VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts
About
Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment. Code is available at [this https URL](https://github.com/pgqihere/VLMShield).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Direct Malicious | MM-SafetyBench OOD | ASR0.71 | 16 | |
| Image-based Jailbreak | JailbreakV_28K IND | ASR0.19 | 16 | |
| Image-based Jailbreak | FigStep OOD | ASR0.00e+0 | 16 | |
| Malicious Prompt Detection | JailbreakV_28K Image-based (test) | FNR0.19 | 16 | |
| Malicious Prompt Detection | JailbreakV_28K Text-based (test) | FNR0.00e+0 | 16 | |
| Direct Malicious | VLSafe OOD | ASR1.62 | 16 | |
| Image-based Jailbreak | HADES OOD | Attack Success Rate (ASR)2.13 | 16 | |
| Text-based Jailbreak | JailbreakV_28K IND | Attack Success Rate (ASR)0.00e+0 | 16 | |
| Text-based Jailbreak | AdvBench-M OOD | ASR (OOD)0.41 | 16 | |
| Computational Efficiency | Malicious Prompt Detection Benchmarks | Detection Time (s)0.34 | 14 |