R3M: A Universal Visual Representation for Robot Manipulation

About

We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, Abhinav Gupta• 2022

Related benchmarks

Task	Dataset	Result
Long-horizon robot manipulation	Calvin ABCD→D	Task 1 Completion Rate75.2	127
Robotic Manipulation	RLBench	Place Cups Success0.00e+0	63
Robotic Manipulation	Franka-Kitchen	Avg Success Rate71	39
Open Door	Meta-World	VOC Score61.35	35
open drawer	Meta-World	VOC Score87.7	28
Reward Modeling	Meta-World Open door	Prediction Accuracy65.02	28
Reward Modeling	Meta-World Button press	Prediction Accuracy74.42	28
Button press	Meta-World	VOC Score90.95	28
Reward Modeling	Meta-World Open drawer	Prediction Accuracy65.71	28
Robot Learning	CortexBench	Adroit Score74.7	22

Showing 10 of 54 rows

Other info

Code

Follow for update

@wizwand_team Discord