Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

About

Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and end-to-end pixel-to-action pipelines may degenerate under background and viewpoint shifts. Based on previous advances in video-based robot control, we present Vidar, consisting of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) as the adapter. We leverage a video diffusion model pre-trained at Internet scale, and further continuously pre-train it for the embodied domain using 750K multi-view trajectories collected from three real-world robot platforms. For this embodied pre-training, we introduce a unified observation space that jointly encodes robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into the target embodiment's action space while suppressing distractors. With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art baselines and generalizes to unseen tasks, backgrounds, and camera layouts. Our results suggest a scalable recipe for "one prior, many embodiments": strong, inexpensive video priors together with minimal on-robot alignment.

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, Jun Zhu• 2025

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationRoboTwin
Success Rate71.1
13
Robot ManipulationReal-world Robot Manipulation Average
Success Rate39
8
Robot ManipulationReal-world scenarios (Seen)
Success Rate72
3
Robot ManipulationReal-world scenarios Unseen
Success Rate0.44
3
Robot ManipulationReal-world scenarios Dynamic
Success Rate0.00e+0
3
Showing 5 of 5 rows

Other info

Follow for update