Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

About

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.

Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista
Accuracy73.5
382
Mathematical ReasoningWeMath
Accuracy68.9
225
Mathematical ReasoningMathVerse
Accuracy51.9
183
Visual Question AnsweringMMMU-Pro
Accuracy39.7
26
General VQAHallusionBench
Accuracy69.8
20
Math ReasoningDynaMath
Worst Case Accuracy (WCA)26.8
11
Math ReasoningMathVision
Accuracy28.3
10
Showing 7 of 7 rows

Other info

Follow for update