Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

About

Large-scale pre-training has been proven to be crucial for various computer vision tasks. However, with the increase of pre-training data amount, model architecture amount, and the private/inaccessible data, it is not very efficient or possible to pre-train all the model architectures on large-scale datasets. In this work, we investigate an alternative strategy for pre-training, namely Knowledge Distillation as Efficient Pre-training (KDEP), aiming to efficiently transfer the learned feature representation from existing pre-trained models to new student models for future downstream tasks. We observe that existing Knowledge Distillation (KD) methods are unsuitable towards pre-training since they normally distill the logits that are going to be discarded when transferred to downstream tasks. To resolve this problem, we propose a feature-based KD method with non-parametric feature dimension aligning. Notably, our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time. Code is available at https://github.com/CVMI-Lab/KDEP.

Ruifei He, Shuyang Sun, Jihan Yang, Song Bai, Xiaojuan Qi• 2022

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU37.3
936
Semantic segmentationCityscapes
mIoU71.93
578
Image ClassificationDTD
Accuracy74.34
487
Image ClassificationCUB
Accuracy77.29
249
Semantic segmentationPASCAL VOC 2012
mIoU74.28
187
Object DetectionCOCO--
144
Image ClassificationCaltech
Accuracy79.08
98
Image ClassificationCIFAR
Accuracy82.89
38
Object DetectionVOC0712
AP47.2
29
Metastasis DetectionCamelyon16 NSCLC source official (test)
AUC83.5
10
Showing 10 of 13 rows

Other info

Code

Follow for update