Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

On the Efficacy of Knowledge Distillation

About

In this paper, we present a thorough evaluation of the efficacy of knowledge distillation and its dependence on student and teacher architectures. Starting with the observation that more accurate teachers often don't make good teachers, we attempt to tease apart the factors that affect knowledge distillation performance. We find crucially that larger models do not often make better teachers. We show that this is a consequence of mismatched capacity, and that small students are unable to mimic large teachers. We find typical ways of circumventing this (such as performing a sequence of knowledge distillation steps) to be ineffective. Finally, we show that this effect can be mitigated by stopping the teacher's training early. Our results generalize across datasets and models.

Jang Hyun Cho, Bharath Hariharan• 2019

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet (val)
Top-1 Acc71.25
1206
Image ClassificationCIFAR-100 (val)
Accuracy76.82
661
Image ClassificationCIFAR-100
Top-1 Accuracy74.95
622
Image ClassificationCIFAR-10
Accuracy92.9
507
Image ClassificationCUB
Accuracy76.87
249
Image ClassificationStanford Dogs
Accuracy71.56
130
Image ClassificationTinyImageNet
Accuracy52.15
108
Showing 7 of 7 rows

Other info

Follow for update