[WACV'26] Multimodal Adversarial Training — Resources

This repository hosts model checkpoints and data resources for the paper:

Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

For the source code and training scripts, please refer to the GitHub repository: 👉 https://github.com/CyberAgentAILab/multimodal-adversarial-training

📘 Overview

This work proposes Multimodal Adversarial Training (MAT) for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, MAT+, additionally leverages one-to-many relationships in image-text pairs to improve robustness.

Highlights

Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP).
MAT+ leverages one-to-many relationships in image-text pairs.
Reproducible results on Flickr30k and COCO benchmarks.

📘 Directory structure

resources/
├── checkpoints/                          # MAT/MAT+ model checkpoints
│     ├── ALBEF_flickr_MAT_HumanCaps.pth
│     ├── BLIP_flickr_MAT_HumanCaps.pth
│     ├── CLIP_B_coco_MAT_HumanCaps.pth
│     ├── CLIP_B_coco_MAT_base.pth
│     └── CLIP_B_flickr_MAT_HumanCaps.pth
└── augmentations/                        # Data augmentations for MAT+
      ├── dataset_json.zip                # Text augmentation annotations
      └── flickr_SD_I2I_0.5.zip          # Image augmentations (SD img2img)

📘 Checkpoints

Adversarially trained model checkpoints for image-text retrieval:

File	Model	Dataset	Variant
`ALBEF_flickr_MAT_HumanCaps.pth`	ALBEF	Flickr30k	MAT + HumanCaps
`BLIP_flickr_MAT_HumanCaps.pth`	BLIP	Flickr30k	MAT + HumanCaps
`CLIP_B_coco_MAT_HumanCaps.pth`	CLIP ViT-B	COCO	MAT + HumanCaps
`CLIP_B_coco_MAT_base.pth`	CLIP ViT-B	COCO	MAT (base)
`CLIP_B_flickr_MAT_HumanCaps.pth`	CLIP ViT-B	Flickr30k	MAT + HumanCaps

The base models used for training are:

ALBEF: salesforce/ALBEF
BLIP: salesforce/BLIP
CLIP: openai/CLIP (ViT-B/16)

📘 Augmentations

Data augmentations used to reproduce MAT+ results:

File	Description
`dataset_json.zip`	Text augmentation data — augmented captions and annotations in JSON format
`flickr_SD_I2I_0.5.zip`	Image augmentation data — Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5)

📘 Usage

Clone or download this repository:

# Using the Hugging Face CLI
hf download cyberagent/multimodal-adversarial-training --local-dir ./resources

# Or using git with LFS
git lfs install
git clone https://huggingface.co/cyberagent/multimodal-adversarial-training

Clone the code repository and follow its setup instructions.
Update the checkpoint and data paths in configs/ to point to the downloaded resources.

📘 Citation

If you find these resources useful, please cite:

@inproceedings{waseda2026multimodal,
  title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
  author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}

📘 Acknowledgements

This work builds upon the following repositories:

Models: ALBEF, BLIP
Attacks: Co-Attack, SGA

📘 License

This repository is licensed under the GNU General Public License v3.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support