Attention Misses Visual Risk: Risk-Adaptive Steering for Multimodal Safety Alignment
About
Even modern AI models often remain vulnerable to multimodal queries in which harmful intent is embedded in images. A widely used approach for safety alignment is training with extensive multimodal safety datasets, but the costs of data curation and training are often prohibitive. To mitigate these costs, inference-time alignment has recently been explored, but they often lack generalizability across diverse multimodal jailbreaks and still incur notable overhead due to extra forward passes for response refinement or heavy pre-deployment calibration procedures. Here, we identify insufficient visual attention to safety-critical image regions as one of the key causes of multimodal safety failures. Building on this insight, we propose Multimodal Risk-Adaptive Steering (MoRAS), which enhances safety-critical visual attention via concise visual contexts for accurate multimodal risk assessment. This risk signal enables risk-adaptive steering for direct refusals, reducing inference overhead while remaining generalizable across diverse multimodal jailbreaks. Notably, MoRAS requires only a small calibration set to estimate multimodal risk, substantially reducing pre-deployment overhead. We conduct various empirical validations across multiple benchmarks and MLLM backbones, and observe that the proposed MoRAS consistently mitigates jailbreaks, preserves utility, and reduces computational overhead compared to state-of-the-art inference-time defenses.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Science Question Answering | ScienceQA | -- | 502 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score50.8 | 431 | |
| Visual Question Answering | GQA | Score63.2 | 193 | |
| Multimodal Evaluation | MM-Vet | Score35.1 | 180 | |
| Over-refusal | XSTest | Overrefusal Rate7.2 | 78 | |
| Multimodal Evaluation | MME | MME-P Score1.62e+3 | 73 | |
| Safety Evaluation | MM-Safety | ASR0.4 | 57 | |
| Safety Alignment | JOOD | ASR0.00e+0 | 40 | |
| Safety Evaluation | SPA-VL | ASR1.2 | 40 | |
| Safety Alignment | Visual Adversarial Attacks | ASR14.3 | 40 |