Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

About

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU46.9
2888
Image ClassificationImageNet-1K
Top-1 Acc85.3
1239
Semantic segmentationADE20K
mIoU47.6
1024
Image ClassificationImageNet 1k (test)
Top-1 Accuracy82.7
848
Semantic segmentationCityscapes
mIoU18.59
658
Image ClassificationImageNet-1K
Top-1 Acc72.4
600
Image ClassificationFood-101
Accuracy70.34
542
Instance SegmentationCOCO
APmask44.5
291
Image ClassificationiNaturalist 2018
Top-1 Accuracy55.3
291
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy84.9
229
Showing 10 of 92 rows
...

Other info

Follow for update