Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

About

Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.

Jesen Zhang, Ningyuan Liu, Kaitong Cai, Sidi Liu, Jing Yang, Ziliang Chen, Xiaofei Sun, Keze Wang• 2025

Related benchmarks

Task	Dataset	Result	Rank
General Multimodal Understanding	General Multimodal Evaluation Suite (MMMU, MMBench, MME, ChartQA, AI2D, HallBench)	MMMU (Val)67.6		14
Visual Perception and Reasoning	V* Bench 1.0 (test)	Attribute Score83.48		13

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord