Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Motus: A Unified Latent Action World Model

About

While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationRoboTwin Clean 2.0
Place Dual Shoes Success94
20
Robot ManipulationRoboTwin Randomized 2.0
Success Rate: Place Dual Shoes94
20
Language-conditioned manipulationLIBERO Long
Avg Success Score97.6
6
Bimanual Robotic ManipulationRoboTwin Easy 2.0
Success Rate (H=1)91
5
Bimanual Robotic ManipulationRoboTwin Hard 2.0
Success Rate (H=1)90.6
5
Brew Coffee using Coffee MakerAC-One
Partial Success Rate62
3
Fold TowelAC-One
Partial Success Rate14.5
3
Fold TowelAgilex-Aloha-2
Partial Success Rate39
3
Get Water from Water DispenserAC-One
Partial Success Rate36
3
Get Water from Water DispenserAgilex-Aloha-2
Partial Success Rate96
3
Showing 10 of 21 rows

Other info

Follow for update