Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

About

Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.

Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha• 2026

Related benchmarks

TaskDatasetResultRank
Visual Mathematical ReasoningMathVista
Accuracy69.74
189
Multimodal UnderstandingMMMU (val)--
111
Visual Mathematical ReasoningMathVerse
Accuracy45.91
73
Visual Mathematical ReasoningMathVision
Accuracy25.29
63
General Visual UnderstandingInfoGraphic-VQA (val)
Accuracy81.56
6
General Visual UnderstandingAI2D
Accuracy83.89
6
General Visual UnderstandingScienceQA
Accuracy89.92
6
Visual MathChartQA
Accuracy85.78
6
Showing 8 of 8 rows

Other info

Follow for update