Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

About

Large-scale pre-training has been proven to be crucial for various computer vision tasks. However, with the increase of pre-training data amount, model architecture amount, and the private/inaccessible data, it is not very efficient or possible to pre-train all the model architectures on large-scale datasets. In this work, we investigate an alternative strategy for pre-training, namely Knowledge Distillation as Efficient Pre-training (KDEP), aiming to efficiently transfer the learned feature representation from existing pre-trained models to new student models for future downstream tasks. We observe that existing Knowledge Distillation (KD) methods are unsuitable towards pre-training since they normally distill the logits that are going to be discarded when transferred to downstream tasks. To resolve this problem, we propose a feature-based KD method with non-parametric feature dimension aligning. Notably, our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time. Code is available at https://github.com/CVMI-Lab/KDEP.

Ruifei He, Shuyang Sun, Jihan Yang, Song Bai, Xiaojuan Qi• 2022

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU37.3	936
Semantic segmentation	Cityscapes	mIoU71.93	578
Image Classification	DTD	Accuracy74.34	487
Image Classification	CUB	Accuracy77.29	249
Semantic segmentation	PASCAL VOC 2012	mIoU74.28	187
Object Detection	COCO	--	144
Image Classification	Caltech	Accuracy79.08	98
Image Classification	CIFAR	Accuracy82.89	38
Object Detection	VOC0712	AP47.2	29
Metastasis Detection	Camelyon16 NSCLC source official (test)	AUC83.5	10

Showing 10 of 13 rows

Other info

Code

Follow for update

@wizwand_team Discord