Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Localization then Neutralization: Gradient-guided Token Suppression against Visual Prompt Injection Attack

About

Adversarial images pose a severe security threat to multimodal large language models through prompt injection. Existing defenses largely lack a principled understanding of the underlying mechanisms and struggle to balance efficiency and defense utility. In this work, we show that successful adversarial attacks do not rely on the entire image uniformly but instead depend on a small subset of critical image tokens. Based on this insight, we propose Gradient Token Masking (GTM), which localizes these tokens via gradient analysis and neutralizes them through masking. We find that attribution based on the first generated token's output probability fails when attacks preserve the predicted token. To overcome this, GTM utilizes the Hidden-State Gradient Norm score for generation-influence attribution under adversarial inputs. We prove that its ranking is consistent with that of the full adversarial loss gradient, providing a theoretical guarantee for accurate localization. Our method requires only a single forward-backward pass to identify and zero out a small number of high-scoring tokens, effectively disrupting the adversarial attack path. Extensive experiments on prompt injection and multimodal jailbreak attacks demonstrate that our approach reduces attack success rates (ASR) to near zero while preserving model utility with negligible computational overhead.

Dongpeng Zhang, Ke Ma, Yangbangyan Jiang, Gaozheng Pei, Longtao Huang, Qianqian Xu, Qingming Huang• 2026

Related benchmarks

TaskDatasetResultRank
Visual Prompt Injection DefenseImgHijack Leak Context
Attack Success Rate0.00e+0
48
Visual Prompt Injection DefenseImgHijack Specific String (test)
Attack Success Rate0.00e+0
32
Jailbreak Defensejailbreak defense dataset
ASR0.00e+0
24
Visual Prompt Injection DefenseImgHijack Specific String
Attack Success Rate0.00e+0
16
Prompt Injection DefenseJailbreak APGD
ASR1
12
Defense against Visual Prompt Injection AttackVMA Attacks
Manipulation Score0.00e+0
12
Prompt Injection DefenseJailbreak MI-FGSM
Attack Success Rate4
12
Visual Prompt Injection DefenseVisual Prompt Injection
Inference Time (s)3.5
7
Multimodal Jailbreak DefenseBAP
Attack Success Rate (ASR)20.91
6
Multimodal Jailbreak DefenseUMK
Attack Success Rate24.55
6
Showing 10 of 11 rows

Other info

Follow for update