Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation
About
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables diverse generalizable robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. https://homangab.github.io/track2act/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Planning Billiard Shots | Billiard Simulator | Accuracy8 | 10 | |
| Motion forecasting | Panthera High motion (test) | Variance (Velocity)8.01 | 9 | |
| Motion forecasting | Panthera Combined (test) | Var (V)1.89 | 9 | |
| Robot Manipulation | Meta-World low-data regime | Door Open Success88 | 8 | |
| Motion forecasting | MammalMotion All Data (High motion) | ADE0.136 | 7 | |
| Motion forecasting | MammalMotion All Data (Combined) | ADE0.053 | 7 | |
| Robot Manipulation Skill Adaptation | Instruction-Guided Skill Adaptation Simulation v1 (test) | Task 1 Success Rate44 | 5 | |
| Robotic Manipulation | IsaacSkill (within-distribution) | Pouring SR88 | 5 | |
| Poked Motion Generation | Pexels Dense | Min MSE138.7 | 3 | |
| Robotic Manipulation | Real-world experiment (test) | Pouring Success Rate44 | 3 |