TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

About

In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers' behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. Moreover, we extend the method into a multi-stage progressive distillation to mitigate the loss of informative weights during extreme compression. Comprehensive experiments demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance. While aiming for comparable performance, distillation with weight inheritance can speed up the training by 1.4 - 7.8 $\times$ compared to training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet, surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9% parameters. Finally, we demonstrate the good transferability of TinyCLIP in various downstream tasks. Code and models will be open-sourced at https://aka.ms/tinyclip.

Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi (Stephen) Chen, Xinggang Wang, Hongyang Chao, Han Hu• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc40.8	1239
Image Classification	ImageNet V2	Top-1 Acc55.7	767
Image Classification	ImageNet A	Top-1 Acc22.8	723
Image Classification	Stanford Cars	--	705
Image Classification	ImageNet-R	Top-1 Acc74.1	622
Text-to-Image Retrieval	Flickr30K	R@166	607
Image Classification	EuroSAT	--	569
Image Classification	Flowers102	Accuracy70	558
Image Classification	RESISC45	--	539
Text-to-Image Retrieval	Flickr30k (test)	Recall@166	528

Showing 10 of 58 rows

Other info

Follow for update

@wizwand_team Discord