MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

About

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Tak\'a\v{c}, Pascal Fua, Ivan Laptev, Karthik Nandakumar• 2024

Related benchmarks

Task	Dataset	Result
Adversarial Detection	ImageNet BLIP-2	Detection Rate99	33
Malicious Prompt Detection	JailBreakV & GPT4V π = 0.005	AUROC66.4	26
Malicious Prompt Detection	VLGuard & MSSBench π = 0.005	AUROC (VLGuard & MSSBench)0.5268	26
Malicious Prompt Detection	VLGuard & MLLMGuard π = 0.005	AUROC52.68	26
Adversarial Detection	ImageNet BLIP	Detection Rate97	24
Adversarial Detection	ImageNet Img2Prompt	Detection Rate92	23
Jailbreak Detection	MM-SafetyBench	AUROC78.1	23
Text-based Jailbreak	AdvBench-M OOD	ASR (OOD)30.15	16
Direct Malicious	VLSafe OOD	ASR26.33	16
Text-based Jailbreak	JailbreakV_28K IND	Attack Success Rate (ASR)20.65	16

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord