Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

About

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU46.9	3069
Image Classification	ImageNet-1K	Top-1 Acc85.3	1239
Semantic segmentation	ADE20K	mIoU47.6	1028
Image Classification	ImageNet 1k (test)	Top-1 Accuracy82.7	880
Semantic segmentation	Cityscapes	mIoU18.59	668
Image Classification	Stanford Cars	Accuracy59.2	660
Image Classification	ImageNet-1K	Top-1 Acc72.4	600
Image Classification	DTD	Accuracy69.89	599
Image Classification	Food-101	Accuracy83.3	570
Semantic segmentation	ADE20K	mIoU51.2	559

Showing 10 of 128 rows

...

Other info

Follow for update

@wizwand_team Discord