Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Motus: A Unified Latent Action World Model

About

While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO--
957
Robotic ManipulationLIBERO
Spatial Success Rate96.8
527
Robotic ManipulationRoboTwin 2.0
Average Success Rate88
100
Robot ManipulationRoboTwin Clean 2.0
Average Success Rate89
39
Robotic ManipulationRoboTwin (random-scene)
Success Rate87.02
36
Robot ManipulationRoboTwin Randomized 2.0
Overall Success Rate87
33
Robot ManipulationLIBERO (All four suites (combined))
Spatial Success Rate96.8
27
Robotic ManipulationRoboTwin Easy 2.0--
19
Robotic ManipulationWISER (test)
Grasp Success34
18
Robotic ManipulationWISER (train)
Grasp Success Rate78
18
Showing 10 of 48 rows

Other info

Follow for update