Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Real-World Robot Learning with Masked Visual Pre-training

About

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, Trevor Darrell• 2022

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO Object
Success Rate63.7
70
Robotic ManipulationFranka-Kitchen
Avg Success Rate76.5
39
Robotic ManipulationMeta-World
Average Success Rate70.1
27
Visuomotor ControlLIBERO Goal
Success Rate63.8
22
Image NavigationImageNav
Success Rate68.1
11
Object NavigationObjectNav
Success Rate55
11
Embodied AIVC-1 MW
Success Rate93.6
9
Embodied AIVC-1 TF
Success Rate73.2
9
Embodied AILIBERO Spatial
Success Rate58
9
Embodied AILIBERO-90
Success Rate32.1
9
Showing 10 of 18 rows

Other info

Follow for update