Causal World Modeling for Robot Control
About
This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Object Achievement99.6 | 957 | |
| Robotic Manipulation | LIBERO | Spatial Success Rate98.5 | 527 | |
| Robot Manipulation | LIBERO (test) | Average Success Rate98.5 | 220 | |
| Robotic Manipulation | RoboTwin 2.0 | Average Success Rate92.2 | 100 | |
| Robot Manipulation | RoboTwin Clean 2.0 | Average Success Rate92.93 | 39 | |
| Robot Manipulation | RoboTwin Randomized 2.0 | Overall Success Rate91.5 | 33 | |
| Robot Manipulation | LIBERO (All four suites (combined)) | Spatial Success Rate98.5 | 27 | |
| Robotic Manipulation | RoboTwin Easy 2.0 | -- | 19 | |
| Tabletop manipulation | LIBERO | Success Rate98.5 | 17 | |
| Dynamic Manipulation | DOMINO 35 suites (full) | Success Rate (SR)24.1 | 16 |