RankCLIP: Ranking-Consistent Language-Image Pretraining
About
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pre-training method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RankCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-10 | -- | 875 | |
| Image Classification | DTD | -- | 599 | |
| Image Classification | FGVC-Aircraft (test) | -- | 322 | |
| Image Classification | Stanford Cars (test) | -- | 320 | |
| Image Classification | CUB-200-2011 (test) | Top-1 Acc36.1 | 303 | |
| Image Classification | Oxford Flowers-102 (test) | Top-1 Accuracy73 | 200 | |
| Image Classification | SVHN | Top-1 Accuracy47.7 | 186 | |
| Image Classification | Food | -- | 152 | |
| Image Classification | GTSRB | Top-1 Accuracy60.6 | 115 | |
| Image Classification | STL10 | Accuracy79.6 | 103 |