MiMo-VL Technical Report
About
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH500 (test) | -- | 895 | |
| Instruction Following | IFEval | -- | 836 | |
| Object Detection | LVIS v1.0 (val) | -- | 542 | |
| Visual Question Answering | ChartQA | Accuracy71.52 | 519 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score66.67 | 517 | |
| Multimodal Understanding | SEED-Bench | -- | 516 | |
| Mathematical Reasoning | MathVista | Score81.5 | 474 | |
| GUI Grounding | ScreenSpot Pro | Average Score41.2 | 458 | |
| Multimodal Understanding | MMStar | Accuracy72.87 | 407 | |
| Diagram Question Answering | AI2D | AI2D Accuracy83.5 | 387 |