MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
About
Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Adversarial Detection | ImageNet BLIP-2 | Detection Rate99 | 33 | |
| Adversarial Detection | ImageNet BLIP | Detection Rate97 | 24 | |
| Adversarial Detection | ImageNet Img2Prompt | Detection Rate92 | 23 | |
| Text-based Jailbreak | AdvBench-M OOD | ASR (OOD)30.15 | 16 | |
| Direct Malicious | VLSafe OOD | ASR26.33 | 16 | |
| Text-based Jailbreak | JailbreakV_28K IND | Attack Success Rate (ASR)20.65 | 16 | |
| Image-based Jailbreak | JailbreakV_28K IND | ASR17.19 | 16 | |
| Image-based Jailbreak | FigStep OOD | ASR15.36 | 16 | |
| Image-based Jailbreak | HADES OOD | Attack Success Rate (ASR)23.09 | 16 | |
| Malicious Prompt Detection | JailbreakV_28K Image-based (test) | FNR17.19 | 16 |