Real-World Robot Learning with Masked Visual Pre-training

About

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, Trevor Darrell• 2022

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO Object	Success Rate63.7	127
Robotic Manipulation	Franka-Kitchen	Avg Success Rate76.5	39
Robotic Manipulation	Meta-World	Average Success Rate70.1	27
Visuomotor Control	LIBERO Goal	Success Rate63.8	22
Robot Manipulation	Franka-Kitchen	Sliding Door Open Success100	15
Image Navigation	ImageNav	Success Rate68.1	11
Object Navigation	ObjectNav	Success Rate55	11
Embodied AI	VC-1 MW	Success Rate93.6	9
Embodied AI	VC-1 TF	Success Rate73.2	9
Embodied AI	LIBERO Spatial	Success Rate58	9

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord