Any-point Trajectory Modeling for Policy Learning
About
Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology. Visualizations and code are available at: \url{https://xingyu-lin.github.io/atm}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Object Achievement89.4 | 957 | |
| Robot Manipulation | LIBERO (test) | Average Success Rate65.7 | 220 | |
| Robot Manipulation | LIBERO Object | Success Rate68 | 127 | |
| Robot Manipulation | LIBERO | Spatial Success Rate69 | 116 | |
| Robotic Manipulation | LIBERO Long | Success Rate39 | 91 | |
| Robotic Manipulation | LIBERO v1 (test) | Average Success Rate37.5 | 83 | |
| Robotic Manipulation | LIBERO Goal | Success Rate78 | 42 | |
| Robotic Manipulation | LIBERO Average across suites | Success Rate (SR)63 | 29 | |
| Robotic Manipulation | LIBERO Spatial | Success Rate (SR)69 | 28 | |
| Open microwave | Simulation | Success Rate99.4 | 18 |