Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Xiaomi MiMo-VL-Miloco Technical Report

About

We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco to support research and deployment in real-world smart-home applications.

Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, Zhenbo Luo, Jian Luan• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH500 (test)--
381
GUI GroundingScreenSpot v2
Avg Accuracy92.1
203
Chart Question AnsweringChartQA (test)
Accuracy92
129
Multimodal UnderstandingMMMU (val)--
111
GUI GroundingScreenSpot Pro--
77
Document Question AnsweringDocVQA (test)
Accuracy95.2
59
Mathematical ReasoningMathVision (test)
Accuracy54
41
Video UnderstandingVideo-MME (test)--
40
Optical Character RecognitionOCRBench (test)--
34
Multimodal UnderstandingMMMU-Pro
Vis Accuracy55.7
20
Showing 10 of 33 rows

Other info

Follow for update