Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

About

Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the practical applications of vision-language models become increasingly widespread, their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks. Therefore, evaluating the robustness of open-source VLMs against adversarial attacks has garnered growing attention, with transfer-based attacks as a representative black-box attacking strategy. However, most existing transfer-based attacks neglect the importance of the semantic correlations between vision and text modalities, leading to sub-optimal adversarial example generation and attack performance. To address this issue, we present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update using a series of intermediate attacking steps, achieving superior adversarial transferability and efficiency. A unified attack success rate computing method is further proposed for automatic evasion evaluation. Extensive experiments conducted under the most realistic and high-stakes scenario, demonstrate that our attacking strategy can effectively mislead models to generate targeted responses using only black-box attacks without any knowledge of the victim models. The comprehensive robustness evaluation in our paper provides insight into the vulnerabilities of VLMs and offers a reference for the safety considerations of future model developments.

Peng Xie, Yequan Bie, Jianda Mao, Yangqiu Song, Yang Wang, Hao Chen, Kani Chen• 2024

Related benchmarks

TaskDatasetResultRank
Black-box AttackVLM Evaluation Set (test)
Ensemble Success Rate69.12
96
Adversarial AttackLVLM Evaluation Set
ASR6
40
Untargeted Adversarial AttackImageNet
ASR (Average)18.6
36
Image Captioning RobustnessImage Captioning Dataset
CLIP Score (RN-50)82.9
30
Untargeted Adversarial AttackFlickr30K 1,000 images (test)
ASR51.4
30
Untargeted Adversarial AttackFlickr30K
ASR32.55
30
Targeted Adversarial AttackImageNet
ASR (Average)0.3
30
Image CaptioningMSCOCO (test)--
29
Targeted Adversarial AttackFlickr30K
ASR2.14
25
Targeted Adversarial AttackFlickr30K 1,000 images (test)
Attack Success Rate (ASR)1.92
25
Showing 10 of 20 rows

Other info

Follow for update