Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

About

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU46.9
3069
Image ClassificationImageNet-1K
Top-1 Acc85.3
1239
Semantic segmentationADE20K
mIoU47.6
1028
Image ClassificationImageNet 1k (test)
Top-1 Accuracy82.7
880
Semantic segmentationCityscapes
mIoU18.59
668
Image ClassificationStanford Cars
Accuracy59.2
660
Image ClassificationImageNet-1K
Top-1 Acc72.4
600
Image ClassificationDTD
Accuracy69.89
599
Image ClassificationFood-101
Accuracy83.3
570
Semantic segmentationADE20K
mIoU51.2
559
Showing 10 of 128 rows
...

Other info

Follow for update