Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability
About
Large-scale pre-training has been proven to be crucial for various computer vision tasks. However, with the increase of pre-training data amount, model architecture amount, and the private/inaccessible data, it is not very efficient or possible to pre-train all the model architectures on large-scale datasets. In this work, we investigate an alternative strategy for pre-training, namely Knowledge Distillation as Efficient Pre-training (KDEP), aiming to efficiently transfer the learned feature representation from existing pre-trained models to new student models for future downstream tasks. We observe that existing Knowledge Distillation (KD) methods are unsuitable towards pre-training since they normally distill the logits that are going to be discarded when transferred to downstream tasks. To resolve this problem, we propose a feature-based KD method with non-parametric feature dimension aligning. Notably, our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time. Code is available at https://github.com/CVMI-Lab/KDEP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K | mIoU37.3 | 936 | |
| Semantic segmentation | Cityscapes | mIoU71.93 | 578 | |
| Image Classification | DTD | Accuracy74.34 | 487 | |
| Image Classification | CUB | Accuracy77.29 | 249 | |
| Semantic segmentation | PASCAL VOC 2012 | mIoU74.28 | 187 | |
| Object Detection | COCO | -- | 144 | |
| Image Classification | Caltech | Accuracy79.08 | 98 | |
| Image Classification | CIFAR | Accuracy82.89 | 38 | |
| Object Detection | VOC0712 | AP47.2 | 29 | |
| Metastasis Detection | Camelyon16 NSCLC source official (test) | AUC83.5 | 10 |