Real-World Robot Learning with Masked Visual Pre-training
About
In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO Object | Success Rate63.7 | 70 | |
| Robotic Manipulation | Franka-Kitchen | Avg Success Rate76.5 | 39 | |
| Robotic Manipulation | Meta-World | Average Success Rate70.1 | 27 | |
| Visuomotor Control | LIBERO Goal | Success Rate63.8 | 22 | |
| Image Navigation | ImageNav | Success Rate68.1 | 11 | |
| Object Navigation | ObjectNav | Success Rate55 | 11 | |
| Embodied AI | VC-1 MW | Success Rate93.6 | 9 | |
| Embodied AI | VC-1 TF | Success Rate73.2 | 9 | |
| Embodied AI | LIBERO Spatial | Success Rate58 | 9 | |
| Embodied AI | LIBERO-90 | Success Rate32.1 | 9 |