Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

About

Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.

Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha• 2026

Related benchmarks

TaskDatasetResultRank
Visual Mathematical ReasoningMathVista
Accuracy69.74
278
Visual Mathematical ReasoningMathVision
Accuracy25.29
186
Multimodal UnderstandingMMMU (val)--
152
Visual Mathematical ReasoningMathVerse
Accuracy45.91
135
General Visual UnderstandingInfoGraphic-VQA (val)
Accuracy81.56
6
General Visual UnderstandingAI2D
Accuracy83.89
6
General Visual UnderstandingScienceQA
Accuracy89.92
6
Visual MathChartQA
Accuracy85.78
6
Showing 8 of 8 rows

Other info

Follow for update