Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RankCLIP: Ranking-Consistent Language-Image Pretraining

About

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pre-training method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RankCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-10--
875
Image ClassificationDTD--
599
Image ClassificationFGVC-Aircraft (test)--
322
Image ClassificationStanford Cars (test)--
320
Image ClassificationCUB-200-2011 (test)
Top-1 Acc36.1
303
Image ClassificationOxford Flowers-102 (test)
Top-1 Accuracy73
200
Image ClassificationSVHN
Top-1 Accuracy47.7
186
Image ClassificationFood--
152
Image ClassificationGTSRB
Top-1 Accuracy60.6
115
Image ClassificationSTL10
Accuracy79.6
103
Showing 10 of 21 rows

Other info

Follow for update