LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
About
Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{LMM-R1}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Multimodal Reasoning | MathVerse | Accuracy41.8 | 259 | |
| Visual Mathematical Reasoning | MathVision | Accuracy25.2 | 254 | |
| Multimodal Math Reasoning | MathVision | Accuracy26.9 | 246 | |
| Multimodal Reasoning | MMStar | Accuracy58 | 143 | |
| Visual Mathematical Reasoning | MathVista (testmini) | Accuracy63.2 | 88 | |
| Vision-centric Reasoning | RealworldQA | Accuracy64 | 66 | |
| Visual Perception and Reasoning | BLINK | Accuracy51.1 | 64 | |
| Multi-modal Reasoning | MMVet (test) | Accuracy65.9 | 49 | |
| Multimodal Mathematical Reasoning | MathVision (test) | Accuracy25.2 | 47 | |
| General Visual Reasoning | MMStar | Accuracy55 | 46 |