Grounding-Driven Attack: Improving Encoder-based Adversarial Transferability against Large Vision-Language Models
About
Large vision-language models (LVLMs) have achieved impressive performance across multimodal tasks, but their reliance on visual inputs exposes them to adversarial threats. Encoder-based attacks provide an efficient alternative to end-to-end optimization by crafting perturbations through the vision encoder alone. However, existing encoder-based attacks often assume that the surrogate encoder is identical or similar to the victim LVLM's vision encoder. In this work, we present a systematic study of their transferability in more realistic black-box deployments with heterogeneous LVLM architectures. We find that model-specific visual evidence is inconsistent across models, whereas text-conditioned grounding regions are more closely tied to caption-relevant evidence and provide a more stable transfer target. However, existing attacks remain weakly aligned with and insufficiently disrupt these regions. Motivated by these findings, we propose Grounding-Driven Attack (GDA), which aligns perturbation optimization with text-grounded evidence. GDA combines Grounding-Aware Perturbation Allocation to concentrate perturbation budget on grounded evidence regions with Grounding-Centric Evidence Disruption to intensify their global and local disruption. Experiments across diverse victim models and tasks show that GDA consistently outperforms existing encoder-based attacks in black-box transfer. These results highlight the central role of text-grounded evidence in adversarial transferability and motivate grounding-aware robustness evaluation and defense design.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Adversarial Attack | LVLM Evaluation Set | ASR64 | 40 | |
| Adversarial Attack | GPT-4o | ASR16.6 | 14 | |
| Targeted Adversarial Attack | GPT-4o | ASR860 | 12 | |
| Adversarial Attack | Gemini 2.0 | ASR13.2 | 11 | |
| Adversarial Attack Imperceptibility | Adversarial Attack (Evaluation Set) | SSIM0.9161 | 9 | |
| Image Classification | CIFAR-10 (test) | CIFAR-10 Classification Score99.6 | 9 | |
| Image Classification | CIFAR-10 BLIP-2 | CLIP Similarity (RN-50)0.2256 | 9 | |
| Adversarial Attack | llava | CLIP Similarity (RN-50)0.2282 | 9 | |
| Adversarial Attack | Qwen VL 2.5 | CLIP Similarity (RN-50)0.2481 | 9 | |
| Image Classification | CIFAR-10 InternVL3 | CLIP Similarity (RN-50)0.2474 | 9 |