MiMo-Embodied: X-Embodied Foundation Model Technical Report

About

We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.

Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, Yuchen Zhang, Jing Wu, Jinghui Lu, Chenxu Dang, Jiayi Guan, Jianhua Wu, Zhiyi Hou, Hanbing Li, Shumeng Xia, Mingliang Zhou, Yinan Zheng, Zihao Yue, Shuhao Gu, Hao Tian, Yuannan Shen, Jianwei Cui, Wen Zhang, Shaoqing Xu, Bing Wang, Haiyang Sun, Zeyu Zhu, Yuncheng Jiang, Zibin Guo, Chuhong Gong, Chaofan Zhang, Wenbo Ding, Kun Ma, Guang Chen, Rui Cai, Diyun Xiang, Heng Qu, Fuli Luo, Hangjun Ye, Long Chen• 2025

Related benchmarks

Task	Dataset	Result
Diagram Understanding	AI2D	Accuracy84.2	377
3D Visual Grounding	ScanRefer	--	172
Spatial Reasoning	EmbSpatial	Overall Accuracy76.2	131
3D Dense Captioning	Scan2Cap	--	127
Visual Reasoning	BLINK	Accuracy0.00e+0	116
Counting	CountBench	Accuracy87.37	102
Spatial Reasoning	MindCube	Accuracy32.3	91
Spatial Reasoning	CV-Bench	Accuracy88.8	89
Multimodal Understanding	MMBench (dev)	MMB Score81.4	73
Embodied Task Completion	EB-Habitat	--	63

Showing 10 of 162 rows

...

Other info

GitHub

Follow for update

@wizwand_team Discord