Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

About

Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.

Shizhan Gong, Minda Hu, Qiyuan Zhang, Chen Ma, Qi Dou• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.1
1455
Chart Question AnsweringChartQA (test)
Accuracy88.2
176
Object Hallucination EvaluationPOPE (test)
Accuracy88.1
79
Science Question AnsweringScienceQA IMG (test)
Accuracy94.3
74
Multimodal Perception and CognitionMME (test)
Overall Score2.39e+3
39
Multimodal Question AnsweringMMBench EN (test)
Accuracy81.8
26
Multimodal Reasoning and PerceptionMMStar (test)
Accuracy62.6
19
Real-world Multimodal EvaluationMME-RW (test)
Overall Score62.9
15
Multimodal ReasoningMMMU-Pro (test)
Accuracy37.6
14
Visual Illusion UnderstandingIllusionVQA loc (test)
Accuracy39.9
11
Showing 10 of 11 rows

Other info

Follow for update