Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

About

Recent advancements utilizing large-scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics-aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two-stage training mechanism inspired by dual-process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre-trained on the Open X-Embodiment dataset (OXE) for predicting future visual trajectories in a video denoising diffusion manner, enabling the model to develop a long horizontal awareness of the environment's dynamics. In the second stage, a flexible yet effective layer-wise self-attention adapter is introduced to transform VidMan into an efficient inverse dynamics model that predicts action modulated by the implicit dynamics knowledge via parameter sharing. Our VidMan framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset. These results provide compelling evidence that world models can significantly enhance the precision of robot action prediction. Codes and models will be public.

Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, Xiaodan Liang• 2024

Related benchmarks

TaskDatasetResultRank
Long-horizon task completionCalvin ABC->D
Success Rate (1)91.5
67
Instruction-following robotic manipulationCALVIN ABC→D (unseen environment D)
Success Rate (Length 1)91.5
29
Long-horizon robotic manipulationCALVIN ABC→D (Zero-shot)
Task 1 Success Rate91.5
16
Long-horizon task completionCALVIN
Success Rate (1 Task)91.5
15
Long-horizon robot manipulationCALVIN
Task Completion Rate (1)91.5
15
Action PredictionBridge offline OXE (test)
MSE0.8
5
Action PredictionTaco Play offline OXE (test)
MSE1
5
Action PredictionCable Routing offline OXE (test)
MSE3.3
5
Action PredictionAutolab UR5 offline OXE (test)
MSE3.4
5
Robot ManipulationRLBench 100
Close Jar882.8
4
Showing 10 of 10 rows

Other info

Follow for update