Vidar: Embodied Video Diffusion Model for Generalist Manipulation
About
Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and end-to-end pixel-to-action pipelines may degenerate under background and viewpoint shifts. Based on previous advances in video-based robot control, we present Vidar, consisting of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) as the adapter. We leverage a video diffusion model pre-trained at Internet scale, and further continuously pre-train it for the embodied domain using 750K multi-view trajectories collected from three real-world robot platforms. For this embodied pre-training, we introduce a unified observation space that jointly encodes robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into the target embodiment's action space while suppressing distractors. With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art baselines and generalizes to unseen tasks, backgrounds, and camera layouts. Our results suggest a scalable recipe for "one prior, many embodiments": strong, inexpensive video priors together with minimal on-robot alignment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | RoboTwin | Success Rate71.1 | 13 | |
| Robot Manipulation | Real-world Robot Manipulation Average | Success Rate39 | 10 | |
| Robotic Manipulation | Real-world Tasks Average | Average Success Rate37.7 | 9 | |
| Offline Action Prediction | AgiBot light (Truncation > 15%) | Accuracy19.4 | 8 | |
| Offline Action Prediction | AgiBot Truncation < 15% (heavy) | Accuracy18.6 | 8 | |
| Robotic Manipulation | RoboTwin Downstream task | Rank-Blk Success Rate52 | 6 | |
| Real-robot Manipulation | Real-robot Seen tasks | Stack Bowl Success Rate45 | 6 | |
| Real-robot Manipulation | Real-robot Novel tasks OOD | Place Block Success Rate35 | 5 | |
| Physical Manipulation | Real-world Pick & Place | Success Rate46.2 | 4 | |
| Physical Manipulation | Real-world Microwave | Success Rate37.9 | 4 |