Attention Misses Visual Risk: Risk-Adaptive Steering for Multimodal Safety Alignment

About

Even modern AI models often remain vulnerable to multimodal queries in which harmful intent is embedded in images. A widely used approach for safety alignment is training with extensive multimodal safety datasets, but the costs of data curation and training are often prohibitive. To mitigate these costs, inference-time alignment has recently been explored, but they often lack generalizability across diverse multimodal jailbreaks and still incur notable overhead due to extra forward passes for response refinement or heavy pre-deployment calibration procedures. Here, we identify insufficient visual attention to safety-critical image regions as one of the key causes of multimodal safety failures. Building on this insight, we propose Multimodal Risk-Adaptive Steering (MoRAS), which enhances safety-critical visual attention via concise visual contexts for accurate multimodal risk assessment. This risk signal enables risk-adaptive steering for direct refusals, reducing inference overhead while remaining generalizable across diverse multimodal jailbreaks. Notably, MoRAS requires only a small calibration set to estimate multimodal risk, substantially reducing pre-deployment overhead. We conduct various empirical validations across multiple benchmarks and MLLM backbones, and observe that the proposed MoRAS consistently mitigates jailbreaks, preserves utility, and reduces computational overhead compared to state-of-the-art inference-time defenses.

Jonghyun Park, Minhyuk Seo, Chaewon Yeo, Jonghyun Choi• 2025

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA	--	916
Multimodal Reasoning	MM-Vet	MM-Vet Score50.8	551
Multimodal Evaluation	MM-Vet	Score35.1	249
Visual Question Answering	GQA	Score63.2	193
Multimodal Evaluation	MME	MME-P Score1.62e+3	139
Over-refusal	XSTest	Overrefusal Rate7.2	102
Safety Evaluation	MM-Safety	ASR0.4	57
Safety Evaluation	FigStep	ASR0.6	47
Safety Alignment	JOOD	ASR0.00e+0	40
Safety Evaluation	SPA-VL	ASR1.2	40

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord