Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts

About

Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment. Code is available at [this https URL](https://github.com/pgqihere/VLMShield).

Peigui Qi, Kunsheng Tang, Yanpu Yu, Jialin Wu, Yide Song, Wenbo Zhou, Zhicong Huang, Cheng Hong, Weiming Zhang, Nenghai Yu• 2026

Related benchmarks

TaskDatasetResultRank
Direct MaliciousMM-SafetyBench OOD
ASR0.71
16
Image-based JailbreakJailbreakV_28K IND
ASR0.19
16
Image-based JailbreakFigStep OOD
ASR0.00e+0
16
Malicious Prompt DetectionJailbreakV_28K Image-based (test)
FNR0.19
16
Malicious Prompt DetectionJailbreakV_28K Text-based (test)
FNR0.00e+0
16
Direct MaliciousVLSafe OOD
ASR1.62
16
Image-based JailbreakHADES OOD
Attack Success Rate (ASR)2.13
16
Text-based JailbreakJailbreakV_28K IND
Attack Success Rate (ASR)0.00e+0
16
Text-based JailbreakAdvBench-M OOD
ASR (OOD)0.41
16
Computational EfficiencyMalicious Prompt Detection Benchmarks
Detection Time (s)0.34
14
Showing 10 of 18 rows

Other info

Follow for update