iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models
About
Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Mathematical Reasoning | MathVista | Accuracy69.74 | 189 | |
| Multimodal Understanding | MMMU (val) | -- | 111 | |
| Visual Mathematical Reasoning | MathVerse | Accuracy45.91 | 73 | |
| Visual Mathematical Reasoning | MathVision | Accuracy25.29 | 63 | |
| General Visual Understanding | InfoGraphic-VQA (val) | Accuracy81.56 | 6 | |
| General Visual Understanding | AI2D | Accuracy83.89 | 6 | |
| General Visual Understanding | ScienceQA | Accuracy89.92 | 6 | |
| Visual Math | ChartQA | Accuracy85.78 | 6 |