Knowledge Distillation by On-the-Fly Native Ensemble
About
Knowledge distillation is effective to train small and generalisable network models for meeting the low-memory and fast running requirements. Existing offline distillation methods rely on a strong pre-trained teacher, which enables favourable knowledge discovery and transfer but requires a complex two-phase training procedure. Online counterparts address this limitation at the price of lacking a highcapacity teacher. In this work, we present an On-the-fly Native Ensemble (ONE) strategy for one-stage online distillation. Specifically, ONE trains only a single multi-branch network while simultaneously establishing a strong teacher on-the- fly to enhance the learning of target network. Extensive evaluations show that ONE improves the generalisation performance a variety of deep neural networks more significantly than alternative methods on four image classification dataset: CIFAR10, CIFAR100, SVHN, and ImageNet, whilst having the computational efficiency advantages.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 (test) | -- | 3518 | |
| Image Classification | CIFAR-10 (test) | -- | 3381 | |
| Image Classification | ImageNet-1k (val) | -- | 1453 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)93.1 | 504 | |
| Image Classification | ImageNet (val) | Accuracy70.18 | 300 | |
| Hyperspectral Image Classification | Pavia University (test) | Average Accuracy (AA)78.88 | 96 | |
| Hyperspectral Image Classification | Indian Pines (test) | Overall Accuracy (OA)71.78 | 83 | |
| Hyperspectral Image Classification | Pavia University (PU) HU-to-PU (test) | Overall Accuracy (OA)0.7942 | 23 | |
| Hyperspectral Image Classification | Indian Pines to Houston Knowledge Transfer (test) | Overall Accuracy (OA)81.73 | 15 | |
| Image Classification | ImageNet (val) | Top-1 Error29.45 | 12 |