Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality

About

Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduc PRISM (Principled Reasoning for Integrated Safety in Multimodality), a System 2-like framework that aligns VLMs through a structured four-stage reasoning process explicitly designed to handle three distinct categories of multimodal safety violations. Our framework consists of two key components: a structured reasoning pipeline that analyzes each violation category in dedicated stages, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to refine reasoning quality through Direct Preference Optimization. Comprehensive evaluations show that PRISM substantially reduces attack success rates on JailbreakV-28K and VLBreak, improves robustness against adaptive attacks, and generalizes to out-of-distribution multi-image threats, while better preserving model utility on benign multimodal benchmarks. Our code, data, and model weights available at https://github.com/SaFoLab-WISC/PRISM.

Nanxi Li, Zhengyue Zhao, G. Edward Suh, Marco Pavone, Chaowei Xiao• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMM-Vet v2
MM-Vet v2 Score62.5
23
Multimodal Jailbreak DefenseMIS-Hard
ASR2.68
12
Safety EvaluationJailbreakV-28K LLM
ASR1.46
10
Safety EvaluationMIS Easy
ASR3.28
10
Helpfulness evaluationMM-Vet2 (test)
GPT-Eval Score48.9
10
Safety EvaluationVLBreak (Challenge)
ASR0.05
10
Safety EvaluationMIS-Hard
Attack Success Rate (ASR)11.29
10
Safety EvaluationJailbreakV-28K MLLM
ASR0.07
10
Showing 8 of 8 rows

Other info

Follow for update