Motus: A Unified Latent Action World Model
About
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | RoboTwin Clean 2.0 | Place Dual Shoes Success94 | 20 | |
| Robot Manipulation | RoboTwin Randomized 2.0 | Success Rate: Place Dual Shoes94 | 20 | |
| Language-conditioned manipulation | LIBERO Long | Avg Success Score97.6 | 6 | |
| Bimanual Robotic Manipulation | RoboTwin Easy 2.0 | Success Rate (H=1)91 | 5 | |
| Bimanual Robotic Manipulation | RoboTwin Hard 2.0 | Success Rate (H=1)90.6 | 5 | |
| Brew Coffee using Coffee Maker | AC-One | Partial Success Rate62 | 3 | |
| Fold Towel | AC-One | Partial Success Rate14.5 | 3 | |
| Fold Towel | Agilex-Aloha-2 | Partial Success Rate39 | 3 | |
| Get Water from Water Dispenser | AC-One | Partial Success Rate36 | 3 | |
| Get Water from Water Dispenser | Agilex-Aloha-2 | Partial Success Rate96 | 3 |