[WACV'26] Multimodal Adversarial Training β Resources
This repository hosts model checkpoints and data resources for the paper:
Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026
For the source code and training scripts, please refer to the GitHub repository: π https://github.com/CyberAgentAILab/multimodal-adversarial-training
π Overview
This work proposes Multimodal Adversarial Training (MAT) for Vision-Language Models (VLMs). MAT is a unified adversarial training pipeline for image-text retrieval models. The extended version, MAT+, additionally leverages one-to-many relationships in image-text pairs to improve robustness.
Highlights
- Unified MAT pipeline for image-text retrieval models (CLIP, ALBEF, BLIP).
- MAT+ leverages one-to-many relationships in image-text pairs.
- Reproducible results on Flickr30k and COCO benchmarks.
π Directory structure
resources/
βββ checkpoints/ # MAT/MAT+ model checkpoints
β βββ ALBEF_flickr_MAT_HumanCaps.pth
β βββ BLIP_flickr_MAT_HumanCaps.pth
β βββ CLIP_B_coco_MAT_HumanCaps.pth
β βββ CLIP_B_coco_MAT_base.pth
β βββ CLIP_B_flickr_MAT_HumanCaps.pth
βββ augmentations/ # Data augmentations for MAT+
βββ dataset_json.zip # Text augmentation annotations
βββ flickr_SD_I2I_0.5.zip # Image augmentations (SD img2img)
π Checkpoints
Adversarially trained model checkpoints for image-text retrieval:
| File | Model | Dataset | Variant |
|---|---|---|---|
ALBEF_flickr_MAT_HumanCaps.pth |
ALBEF | Flickr30k | MAT + HumanCaps |
BLIP_flickr_MAT_HumanCaps.pth |
BLIP | Flickr30k | MAT + HumanCaps |
CLIP_B_coco_MAT_HumanCaps.pth |
CLIP ViT-B | COCO | MAT + HumanCaps |
CLIP_B_coco_MAT_base.pth |
CLIP ViT-B | COCO | MAT (base) |
CLIP_B_flickr_MAT_HumanCaps.pth |
CLIP ViT-B | Flickr30k | MAT + HumanCaps |
The base models used for training are:
- ALBEF: salesforce/ALBEF
- BLIP: salesforce/BLIP
- CLIP: openai/CLIP (ViT-B/16)
π Augmentations
Data augmentations used to reproduce MAT+ results:
| File | Description |
|---|---|
dataset_json.zip |
Text augmentation data β augmented captions and annotations in JSON format |
flickr_SD_I2I_0.5.zip |
Image augmentation data β Flickr30k images augmented via Stable Diffusion image-to-image (strength 0.5) |
π Usage
Clone or download this repository:
# Using the Hugging Face CLI hf download cyberagent/multimodal-adversarial-training --local-dir ./resources # Or using git with LFS git lfs install git clone https://huggingface.co/cyberagent/multimodal-adversarial-trainingClone the code repository and follow its setup instructions.
Update the checkpoint and data paths in
configs/to point to the downloaded resources.
π Citation
If you find these resources useful, please cite:
@inproceedings{waseda2026multimodal,
title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2026}
}
π Acknowledgements
This work builds upon the following repositories:
π License
This repository is licensed under the GNU General Public License v3.0.