Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CLIP-KD: An Empirical Study of CLIP Model Distillation

About

Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover, interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\% and 20.1\% margins, respectively. Our code is released on https://github.com/winycg/CLIP-KD.

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, Yongjun Xu• 2023

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1k (val)--
1453
Text-to-Image RetrievalFlickr30K
R@161.4
460
Image-to-Text RetrievalFlickr30K
R@181.7
379
Zero-shot Image ClassificationImageNet-1k (val)
Accuracy42.6
28
ClassificationImageNet shift
Accuracy41.6
22
Image-Text RetrievalCOCO
Retrieval Score43.5
21
ClassificationScene-Centric Datasets
Accuracy50
21
Zero-shot EvaluationStableEval 27 evals
Average Performance52.2
21
ClassificationObject-Centric datasets
Accuracy61.8
21
Showing 9 of 9 rows

Other info

Follow for update