Non-Contrastive Learning Meets Language-Image Pre-Training

About

Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and craving for a large training batch size. In this work, we explore the validity of non-contrastive language-image pre-training (nCLIP), and study whether nice properties exhibited in visual self-supervised models can emerge. We empirically observe that the non-contrastive objective nourishes representation learning while sufficiently underperforming under zero-shot recognition. Based on the above study, we further introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics. The synergy between two objectives lets xCLIP enjoy the best of both worlds: superior performance in both zero-shot transfer and representation learning. Systematic evaluation is conducted spanning a wide variety of downstream tasks including zero-shot classification, out-of-domain classification, retrieval, visual representation learning, and textual representation learning, showcasing a consistent performance gain and validating the effectiveness of xCLIP.

Jinghao Zhou, Li Dong, Zhe Gan, Lijuan Wang, Furu Wei• 2022

Related benchmarks

Task	Dataset	Result
Image Classification	Stanford Cars	--	660
Image Classification	ImageNet-1K	Top-1 Acc74.1	600
Image Classification	Food-101	--	570
Image Classification	EuroSAT	Accuracy40	569
Text-to-Image Retrieval	Flickr30K	R@157.3	559
Classification	Cars	Accuracy18	492
Image Classification	RESISC45	--	472
Image Classification	ImageNet	--	431
Image Classification	SUN397	Accuracy59.9	425
Image Classification	MNIST	--	417

Showing 10 of 51 rows

Other info

Follow for update

@wizwand_team Discord