Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

About

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan• 2025

Related benchmarks

TaskDatasetResultRank
Multi-discipline Multimodal UnderstandingMMMU
Accuracy55.31
266
Visual Mathematical ReasoningMathVista
Accuracy70.52
189
Multimodal UnderstandingMMMU (val)--
111
Hallucination EvaluationHallusionBench--
93
Multimodal ReasoningMMStar
Accuracy61.53
81
Visual Mathematical ReasoningMathVerse
Accuracy44.88
73
Visual Mathematical ReasoningMathVision
Accuracy24.81
63
Multi-discipline Multimodal UnderstandingMMMU-Pro
Accuracy51.6
56
Visual Mathematical ReasoningWeMath
Accuracy64.89
53
Visual Logical ReasoningLogicVista
Accuracy46.65
28
Showing 10 of 17 rows

Other info

Follow for update