Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

About

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU46.9
2731
Semantic segmentationADE20K
mIoU47.6
936
Image ClassificationImageNet-1K
Top-1 Acc85.3
836
Semantic segmentationCityscapes
mIoU18.59
578
Image ClassificationFood-101
Accuracy70.34
494
Image ClassificationiNaturalist 2018
Top-1 Accuracy55.3
287
Instance SegmentationCOCO
APmask44.5
279
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy84.9
197
Image ClassificationiNaturalist 2018 (test)
Top-1 Accuracy75.9
192
Image ClassificationImageNet-1K
Accuracy84.9
190
Showing 10 of 57 rows

Other info

Follow for update