Vidar: Embodied Video Diffusion Model for Generalist Manipulation

About

Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and end-to-end pixel-to-action pipelines may degenerate under background and viewpoint shifts. Based on previous advances in video-based robot control, we present Vidar, consisting of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) as the adapter. We leverage a video diffusion model pre-trained at Internet scale, and further continuously pre-train it for the embodied domain using 750K multi-view trajectories collected from three real-world robot platforms. For this embodied pre-training, we introduce a unified observation space that jointly encodes robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into the target embodiment's action space while suppressing distractors. With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art baselines and generalizes to unseen tasks, backgrounds, and camera layouts. Our results suggest a scalable recipe for "one prior, many embodiments": strong, inexpensive video priors together with minimal on-robot alignment.

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, Jun Zhu• 2025

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	RoboTwin	Success Rate71.1	13
Robot Manipulation	Real-world Robot Manipulation Average	Success Rate39	10
Robotic Manipulation	Real-world Tasks Average	Average Success Rate37.7	9
Offline Action Prediction	AgiBot light (Truncation > 15%)	Accuracy19.4	8
Offline Action Prediction	AgiBot Truncation < 15% (heavy)	Accuracy18.6	8
Robotic Manipulation	RoboTwin Downstream task	Rank-Blk Success Rate52	6
Real-robot Manipulation	Real-robot Seen tasks	Stack Bowl Success Rate45	6
Real-robot Manipulation	Real-robot Novel tasks OOD	Place Block Success Rate35	5
Physical Manipulation	Real-world Pick & Place	Success Rate46.2	4
Physical Manipulation	Real-world Microwave	Success Rate37.9	4

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord