Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
About
In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU38.68 | 2888 | |
| Image Classification | ImageNet-1K | Top-1 Acc86.2 | 1239 | |
| Semantic segmentation | ADE20K | mIoU48.7 | 1024 | |
| Instance Segmentation | COCO | APmask45.3 | 291 | |
| Object Detection | COCO | AP (Box)50.7 | 144 | |
| Video Object Segmentation | DAVIS | -- | 66 | |
| Image Classification | ImageNet | Top-1 Acc86.9 | 65 | |
| Instance Segmentation | MS-COCO | mAP Mask45.3 | 60 | |
| Fine-grained Image Classification | CUB | Top-1 Acc65.8 | 35 | |
| Instance Segmentation | MS-COCO (val) | -- | 16 |