Deep Mutual Learning
About
Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to low-memory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static pre-defined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR-100 recognition and Market-1501 person re-identification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary -- mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 (test) | -- | 3518 | |
| Image Classification | CIFAR-10 (test) | -- | 3381 | |
| Person Re-Identification | Market1501 (test) | Rank-1 Accuracy89.34 | 1264 | |
| Image Classification | ImageNet (val) | Top-1 Acc71.35 | 1206 | |
| Person Re-Identification | Market 1501 | mAP68.8 | 999 | |
| Image Classification | CIFAR-10 (test) | Accuracy87.71 | 906 | |
| Image Classification | CIFAR-100 (val) | Accuracy73.58 | 661 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)93.3 | 504 | |
| Natural Language Understanding | GLUE (test) | SST-2 Accuracy92.7 | 416 | |
| Image Classification | ImageNet (val) | Accuracy69.82 | 300 |