R3M: A Universal Visual Representation for Robot Manipulation
About
We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-horizon robot manipulation | Calvin ABCD→D | Task 1 Completion Rate75.2 | 96 | |
| Robotic Manipulation | RLBench | Avg Success Score50.3 | 56 | |
| Open Door | Meta-World | VOC Score61.35 | 35 | |
| open drawer | Meta-World | VOC Score87.7 | 28 | |
| Reward Modeling | Meta-World Open door | Prediction Accuracy65.02 | 28 | |
| Reward Modeling | Meta-World Button press | Prediction Accuracy74.42 | 28 | |
| Button press | Meta-World | VOC Score90.95 | 28 | |
| Reward Modeling | Meta-World Open drawer | Prediction Accuracy65.71 | 28 | |
| Robotic Manipulation | Franka-Kitchen | Avg Success Rate71 | 24 | |
| Visuomotor Control | LIBERO Goal | Success Rate89 | 13 |