Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Attention Misses Visual Risk: Risk-Adaptive Steering for Multimodal Safety Alignment

About

Even modern AI models often remain vulnerable to multimodal queries in which harmful intent is embedded in images. A widely used approach for safety alignment is training with extensive multimodal safety datasets, but the costs of data curation and training are often prohibitive. To mitigate these costs, inference-time alignment has recently been explored, but they often lack generalizability across diverse multimodal jailbreaks and still incur notable overhead due to extra forward passes for response refinement or heavy pre-deployment calibration procedures. Here, we identify insufficient visual attention to safety-critical image regions as one of the key causes of multimodal safety failures. Building on this insight, we propose Multimodal Risk-Adaptive Steering (MoRAS), which enhances safety-critical visual attention via concise visual contexts for accurate multimodal risk assessment. This risk signal enables risk-adaptive steering for direct refusals, reducing inference overhead while remaining generalizable across diverse multimodal jailbreaks. Notably, MoRAS requires only a small calibration set to estimate multimodal risk, substantially reducing pre-deployment overhead. We conduct various empirical validations across multiple benchmarks and MLLM backbones, and observe that the proposed MoRAS consistently mitigates jailbreaks, preserves utility, and reduces computational overhead compared to state-of-the-art inference-time defenses.

Jonghyun Park, Minhyuk Seo, Chaewon Yeo, Jonghyun Choi• 2025

Related benchmarks

TaskDatasetResultRank
Science Question AnsweringScienceQA--
502
Multimodal ReasoningMM-Vet
MM-Vet Score50.8
431
Visual Question AnsweringGQA
Score63.2
193
Multimodal EvaluationMM-Vet
Score35.1
180
Over-refusalXSTest
Overrefusal Rate7.2
78
Multimodal EvaluationMME
MME-P Score1.62e+3
73
Safety EvaluationMM-Safety
ASR0.4
57
Safety AlignmentJOOD
ASR0.00e+0
40
Safety EvaluationSPA-VL
ASR1.2
40
Safety AlignmentVisual Adversarial Attacks
ASR14.3
40
Showing 10 of 15 rows

Other info

Follow for update