Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens

About

Vision-language models (VLMs) seamlessly integrate visual and textual data to perform tasks such as image classification, caption generation, and visual question answering. However, adversarial images often struggle to deceive all prompts effectively in the context of cross-prompt migration attacks, as the probability distribution of the tokens in these images tends to favor the semantics of the original image rather than the target tokens. To address this challenge, we propose a Contextual-Injection Attack (CIA) that employs gradient-based perturbation to inject target tokens into both visual and textual contexts, thereby improving the probability distribution of the target tokens. By shifting the contextual semantics towards the target tokens instead of the original image semantics, CIA enhances the cross-prompt transferability of adversarial images.Extensive experiments on the BLIP2, InstructBLIP, and LLaVA models show that CIA outperforms existing methods in cross-prompt transferability, demonstrating its potential for more effective adversarial strategies in VLMs.

Xikang Yang, Xuehai Tang, Fuqing Zhu, Jizhong Han, Songlin Hu• 2024

Related benchmarks

Task	Dataset	Result
Captioning	Open Flamingo	Targeted ASR50.8	4
Classification	Open Flamingo	Targeted ASR51.12	4
Image Captioning	BLIP-2 evaluation suite	Targeted ASR46.87	4
Image Classification	BLIP-2 evaluation suite	Targeted ASR48.57	4
Targeted Adversarial Attack	Blip2 evaluation suite Target: 'Bomb' (test)	VQA General Performance34.31	4
Vision-Language Tasks (Overall)	BLIP-2 evaluation suite	Targeted ASR37	4
Visual Question Answering (general)	BLIP-2 evaluation suite	Targeted ASR29.85	4
Visual Question Answering (specific)	BLIP-2 evaluation suite	Targeted ASR22.81	4
VQAgeneral	Open Flamingo	Targeted ASR30.27	4
VQAspecific	Open Flamingo	Targeted ASR43.02	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord