Non-Contrastive Learning Meets Language-Image Pre-Training
About
Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and craving for a large training batch size. In this work, we explore the validity of non-contrastive language-image pre-training (nCLIP), and study whether nice properties exhibited in visual self-supervised models can emerge. We empirically observe that the non-contrastive objective nourishes representation learning while sufficiently underperforming under zero-shot recognition. Based on the above study, we further introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics. The synergy between two objectives lets xCLIP enjoy the best of both worlds: superior performance in both zero-shot transfer and representation learning. Systematic evaluation is conducted spanning a wide variety of downstream tasks including zero-shot classification, out-of-domain classification, retrieval, visual representation learning, and textual representation learning, showcasing a consistent performance gain and validating the effectiveness of xCLIP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | Top-1 Acc74.1 | 524 | |
| Image Classification | EuroSAT | Accuracy40 | 497 | |
| Image Classification | Food-101 | -- | 494 | |
| Image Classification | Stanford Cars | -- | 477 | |
| Text-to-Image Retrieval | Flickr30K | R@157.3 | 460 | |
| Image Classification | ImageNet | -- | 429 | |
| Image Classification | SUN397 | Accuracy59.9 | 425 | |
| Image Classification | MNIST | -- | 395 | |
| Image Classification | CIFAR100 | Accuracy54.5 | 331 | |
| Classification | Cars | Accuracy18 | 314 |