Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
About
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU46.9 | 2731 | |
| Semantic segmentation | ADE20K | mIoU47.6 | 936 | |
| Image Classification | ImageNet-1K | Top-1 Acc85.3 | 836 | |
| Semantic segmentation | Cityscapes | mIoU18.59 | 578 | |
| Image Classification | Food-101 | Accuracy70.34 | 494 | |
| Image Classification | iNaturalist 2018 | Top-1 Accuracy55.3 | 287 | |
| Instance Segmentation | COCO | APmask44.5 | 279 | |
| Image Classification | ImageNet-1k 1.0 (test) | Top-1 Accuracy84.9 | 197 | |
| Image Classification | iNaturalist 2018 (test) | Top-1 Accuracy75.9 | 192 | |
| Image Classification | ImageNet-1K | Accuracy84.9 | 190 |