Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

About

In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.

Shentong Mo, Shengbang Tong• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU38.68
2888
Image ClassificationImageNet-1K
Top-1 Acc86.2
1239
Semantic segmentationADE20K
mIoU48.7
1024
Instance SegmentationCOCO
APmask45.3
291
Object DetectionCOCO
AP (Box)50.7
144
Video Object SegmentationDAVIS--
66
Image ClassificationImageNet
Top-1 Acc86.9
65
Instance SegmentationMS-COCO
mAP Mask45.3
60
Fine-grained Image ClassificationCUB
Top-1 Acc65.8
35
Instance SegmentationMS-COCO (val)--
16
Showing 10 of 17 rows

Other info

Follow for update