Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Understanding and Rectifying Safety Perception Distortion in VLMs

About

Recent studies reveal that vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality, exhibiting greater vulnerability than their text-only LLM backbones. To uncover the root cause of this phenomenon, we conduct an in-depth analysis and identify a key issue: multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts, leading VLMs to systematically overestimate the safety of harmful inputs. We refer to this issue as safety perception distortion. To mitigate such distortion, we propose Activation Shift Disentanglement and Calibration (ShiftDC), a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety. By isolating and removing the safety-relevant component, ShiftDC restores the inherent safety alignment of the LLM backbone while preserving the vision-language capabilities of VLMs. Empirical results demonstrate that ShiftDC significantly enhances alignment performance on safety benchmarks without impairing model utility.

Xiaohan Zou, Jian Kang, George Kesidis, Lu Lin• 2025

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackHADES
Attack Success Rate65.3
59
Safety EvaluationMM-Safety
ASR10.31
57
Jailbreak Attack DefenseMM-SafetyBench
Attack Success Rate (ASR)28.1
56
Jailbreak AttackRedTeam 2K
ASR33
52
Vision-Language UnderstandingMM-Vet--
43
Safety EvaluationMM-SafetyBench
Average ASR2.24
42
Jailbreak DefenseHADES
ASR47.3
24
Safety EvaluationJailBreakV
ASR11.17
18
Safety EvaluationJailbreakV-28K v1 (test)
ASR (Noise-T)7.72
18
Jailbreak Attack Success EvaluationRedTeam2K SD+TYPO
Attack Success Rate (ASR)42.5
18
Showing 10 of 23 rows

Other info

Follow for update