LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

About

Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{LMM-R1}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, Xu Yang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Multimodal Reasoning	MathVerse	Accuracy41.8	259
Visual Mathematical Reasoning	MathVision	Accuracy25.2	254
Multimodal Math Reasoning	MathVision	Accuracy26.9	246
Multimodal Reasoning	MMStar	Accuracy58	143
Visual Mathematical Reasoning	MathVista (testmini)	Accuracy63.2	88
Vision-centric Reasoning	RealworldQA	Accuracy64	66
Visual Perception and Reasoning	BLINK	Accuracy51.1	64
Multi-modal Reasoning	MMVet (test)	Accuracy65.9	49
Multimodal Mathematical Reasoning	MathVision (test)	Accuracy25.2	47
General Visual Reasoning	MMStar	Accuracy55	46

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord