Motubrain: An Advanced World Action Model for Robot Control

About

Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present Motubrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, Motubrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, Motubrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability.

Motubrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, Louis Liu, Mengchen Cai, Rongxu Cui, Ruowen Zhao, Runqing Wang, Shuhe Huang, Yao Feng, Yinze Rong, Zeyuan Wang, Jun Zhu• 2026

Related benchmarks

Task	Dataset	Result
Robot Manipulation	RoboTwin Randomized 2.0	Overall Success Rate96.1	100
Robot Manipulation	RoboTwin Clean 2.0	Success Rate95.8	74
Bimanual Manipulation	RoboTwin Clean setting 2.0	Success Rate95.8	36
Robotic Manipulation	RoboTwin 50-task (Seen Tasks)	Average Success Rate95.9	27
Bimanual Manipulation	RoboTwin 2.0 (random)	Success Rate96.1	26
Bimanual Manipulation	RoboTwin 2.0	Success Rate96	25
Robotic Manipulation	RoboTwin Hard 2.0	Overall Success Rate96.1	21
World Model Evaluation	World Arena Benchmark	EWM Score64.07	15
World Modeling	WorldArena (test)	Image Quality44.59	15

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord