Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

About

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1 x fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.Our code, dataset and models are released at: https://github.com/Sense-GVT/DeCLIP

Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, Junjie Yan• 2021

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1k (val)
Top-1 Accuracy44.4
844
Image ClassificationCIFAR-100
Accuracy71.2
691
Image ClassificationStanford Cars
Accuracy81.7
635
Image ClassificationDTD
Accuracy76.8
542
Image ClassificationFood-101
Accuracy82.7
542
Image ClassificationCIFAR-10
Accuracy89.8
508
Text-to-Image RetrievalFlickr30k (test)
Recall@146.3
445
Image ClassificationSUN397
Accuracy72.8
425
ClassificationCars
Accuracy3.8
395
Image-to-Text RetrievalFlickr30k (test)
R@160.4
392
Showing 10 of 39 rows

Other info

Follow for update