Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

About

Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia's rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code, models, and demo are available at https://theia.theaiinstitute.com.

Jinghuan Shang, Karl Schmeckpeper, Brandon B. May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, Laura Herlant• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU35.55
1024
Semantic segmentationPascal Context
mIoU69.84
217
Semantic segmentationNYUD v2
mIoU38.9
125
Semantic segmentationScanNet
mIoU14.71
59
Depth EstimationNYU V2
RMSE0.6377
57
Semantic segmentationPascal Context
mIoU69.84
43
Semantic segmentationSUN-RGBD
IoU11.18
37
Saliency DetectionPascal Context
maxF Score80.63
28
Surface Normal EstimationPascal Context
Mean Error (MAE)16.94
28
Depth EstimationNYUD
RMSE0.6377
25
Showing 10 of 30 rows

Other info

Follow for update