[WACV'26] Multimodal Adversarial Training β€” Resources

This repository hosts model checkpoints and data resources for the paper:

Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

For the source code and training scripts, please refer to the GitHub repository: πŸ‘‰ https://github.com/CyberAgentAILab/multimodal-adversarial-training

πŸ“˜ Overview

This work proposes Multimodal Adversarial Training (MAT) for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, MAT+, additionally leverages one-to-many relationships in image-text pairs to improve robustness.

Highlights

  • Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP).
  • MAT+ leverages one-to-many relationships in image-text pairs.
  • Reproducible results on Flickr30k and COCO benchmarks.

πŸ“˜ Directory structure

resources/
β”œβ”€β”€ checkpoints/                          # MAT/MAT+ model checkpoints
β”‚     β”œβ”€β”€ ALBEF_flickr_MAT_HumanCaps.pth
β”‚     β”œβ”€β”€ BLIP_flickr_MAT_HumanCaps.pth
β”‚     β”œβ”€β”€ CLIP_B_coco_MAT_HumanCaps.pth
β”‚     β”œβ”€β”€ CLIP_B_coco_MAT_base.pth
β”‚     └── CLIP_B_flickr_MAT_HumanCaps.pth
└── augmentations/                        # Data augmentations for MAT+
      β”œβ”€β”€ dataset_json.zip                # Text augmentation annotations
      └── flickr_SD_I2I_0.5.zip          # Image augmentations (SD img2img)

πŸ“˜ Checkpoints

Adversarially trained model checkpoints for image-text retrieval:

File Model Dataset Variant
ALBEF_flickr_MAT_HumanCaps.pth ALBEF Flickr30k MAT + HumanCaps
BLIP_flickr_MAT_HumanCaps.pth BLIP Flickr30k MAT + HumanCaps
CLIP_B_coco_MAT_HumanCaps.pth CLIP ViT-B COCO MAT + HumanCaps
CLIP_B_coco_MAT_base.pth CLIP ViT-B COCO MAT (base)
CLIP_B_flickr_MAT_HumanCaps.pth CLIP ViT-B Flickr30k MAT + HumanCaps

The base models used for training are:

πŸ“˜ Augmentations

Data augmentations used to reproduce MAT+ results:

File Description
dataset_json.zip Text augmentation data β€” augmented captions and annotations in JSON format
flickr_SD_I2I_0.5.zip Image augmentation data β€” Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5)

πŸ“˜ Usage

  1. Clone or download this repository:

    # Using the Hugging Face CLI
    hf download cyberagent/multimodal-adversarial-training --local-dir ./resources
    
    # Or using git with LFS
    git lfs install
    git clone https://huggingface.co/cyberagent/multimodal-adversarial-training
    
  2. Clone the code repository and follow its setup instructions.

  3. Update the checkpoint and data paths in configs/ to point to the downloaded resources.

πŸ“˜ Citation

If you find these resources useful, please cite:

@inproceedings{waseda2026multimodal,
  title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
  author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}

πŸ“˜ Acknowledgements

This work builds upon the following repositories:

πŸ“˜ License

This repository is licensed under the GNU General Public License v3.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support