Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction

About

With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.

Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiao-Jun Wu, Josef Kittler• 2026

Related benchmarks

TaskDatasetResultRank
Text RetrievalFlickr30k (test)
R@1 (ASR)100
340
Visual GroundingRefCOCO+ (val)
Accuracy46.78
212
Image RetrievalFlickr30k (test)--
210
Visual GroundingRefCOCO+ (testA)
Accuracy52.41
206
Visual GroundingRefCOCO+ (testB)
Accuracy37.1
180
Adversarial AttackFlickr30K
ASR0.8634
48
Text RetrievalMSCOCO
ASR@R1100
33
Image RetrievalMSCOCO
ASR@R1100
13
Image CaptioningMSCOCO
BLEU-417.4
5
Showing 9 of 9 rows

Other info

Follow for update