Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

About

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. Project page at https://video-prediction-policy.github.io

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen• 2024

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement95	1025
Long-horizon robot manipulation	Calvin ABCD→D	Task 1 Completion Rate95.7	140
Robotic Manipulation	Calvin ABCD→D	Avg Length4.28	139
Long-horizon task completion	Calvin ABC->D	Success Rate (1)96.5	72
Robotic Manipulation	Calvin ABC->D	Task-1 Score96.5	71
Sequential Robotic Manipulation	CALVIN	Success Rate (1 task)96.5	63
Robot Manipulation	Calvin ABC->D	Average Successful Length4.33	62
Long-horizon robotic manipulation	Calvin ABC->D	Average Trajectory Length3.93	48
Instruction-following robotic manipulation	CALVIN ABC→D (unseen environment D)	Success Rate (Length 1)96.5	29
Language-conditioned long-horizon robotic manipulation	Calvin ABC->D	Success Rate (1 Task)95.3	22

Showing 10 of 74 rows

...

Other info

Code

Follow for update

@wizwand_team Discord