Revisiting Knowledge Distillation for Autoregressive Language Models
About
Knowledge distillation (KD) is a common approach to compress a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, in the context of autoregressive language models (LMs), we empirically find that larger teacher LMs might dramatically result in a poorer student. In response to this problem, we conduct a series of analyses and reveal that different tokens have different teaching modes, neglecting which will lead to performance degradation. Motivated by this, we propose a simple yet effective adaptive teaching approach (ATKD) to improve the KD. The core of ATKD is to reduce rote learning and make teaching more diverse and flexible. Extensive experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains (up to +3.04% average score) across all model types and sizes. More encouragingly, ATKD can improve the student model generalization effectively.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | LAMBADA (test) | -- | 71 | |
| Instruction Following | IFEval (test) | IFEval Score20.7 | 45 | |
| Language Modeling | SNLG and SNLU evaluation suites (test) | SNLG Score62.66 | 44 | |
| Natural Language Generation and Understanding | S_NLG and S_NLU (test) | Average Performance61.81 | 20 | |
| General Knowledge Evaluation | General-purpose benchmarks average (test) | Accuracy63.8 | 12 | |
| Generation | DollyEval (test) | LLM-as-a-Judge Score62.02 | 2 | |
| Generation | VicunaEval (test) | LLM Judge Score56.07 | 2 | |
| Generation | SelfInst (test) | LLM-as-a-Judge Score60.16 | 2 | |
| Generation | WizardLM (test) | LLM-as-a-Judge Score48.37 | 2 |