Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

About

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. Project page at https://video-prediction-policy.github.io

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen• 2024

Related benchmarks

TaskDatasetResultRank
Long-horizon robot manipulationCalvin ABCD→D
Task 1 Completion Rate95.7
96
Long-horizon task completionCalvin ABC->D
Success Rate (1)96.5
67
Robot ManipulationCalvin ABC->D
Average Successful Length4.329
36
Instruction-following robotic manipulationCALVIN ABC→D (unseen environment D)
Success Rate (Length 1)96.5
29
Robotic ManipulationCalvin ABCD→D
Success Rate (1 Inst)95.7
26
Robot ManipulationMetaWorld 50 tasks
Success Rate (Easy)81.8
21
Long-horizon robotic manipulationCALVIN ABC→D (Zero-shot)
Task 1 Success Rate96.5
16
Long-Horizon Multi-Task Language ControlCALVIN ABC→D (test)
Seq Success (1)90.9
13
Bimanual Robot ManipulationRoboTwin easy setting 2.0
Handover Block54
7
Robotic ManipulationAloha-AgileX Real-World Basic Tasks (evaluation)
Average Success Rate59.5
7
Showing 10 of 21 rows

Other info

Code

Follow for update