RankCLIP: Ranking-Consistent Language-Image Pretraining

About

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pre-training method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RankCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10	--	973
Image Classification	DTD	--	610
Image Classification	FGVC-Aircraft (test)	--	322
Image Classification	Stanford Cars (test)	--	320
Image Classification	CUB-200-2011 (test)	Top-1 Acc36.1	316
Image Classification	Oxford Flowers-102 (test)	Top-1 Accuracy73	221
Image Classification	SVHN	Top-1 Accuracy47.7	209
Image Classification	Food	--	152
Image Classification	GTSRB	Top-1 Accuracy60.6	115
Image Classification	STL10	Accuracy79.6	108

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord