Motus: A Unified Latent Action World Model

About

While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement99.8	1025
Robotic Manipulation	LIBERO	Spatial Success Rate96.8	570
Robot Manipulation	LIBERO	Spatial Success Rate96.8	223
Robotic Manipulation	LIBERO	Long-horizon Success Rate97.6	165
Robotic Manipulation	LIBERO v1 (test)	Average Success Rate97.7	118
Robotic Manipulation	RoboTwin 2.0	Average Success Rate88	115
Robotic Manipulation	LIBERO	Long Success Rate97.6	108
Robot Manipulation	RoboTwin Randomized 2.0	Overall Success Rate87.02	100
Robot Manipulation	LIBERO	Spatial Success96.8	90
Robotic Manipulation	LIBERO (test)	Object Success Rate99.8	85

Showing 10 of 90 rows

...

Other info

Follow for update

@wizwand_team Discord