DistiLLM: Towards Streamlined Distillation for Large Language Models
About
Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy57.2 | 1362 | |
| Mathematical Reasoning | MATH | Accuracy21.2 | 882 | |
| Reasoning | BBH | Accuracy36.5 | 672 | |
| Instruction Following | IFEval | IFEval Accuracy62.2 | 625 | |
| Logical reasoning | BBH | Accuracy36.5 | 201 | |
| Arithmetic Reasoning | GSM8K | Accuracy0.00e+0 | 173 | |
| Instruction Following | UnNI | Rouge-L38.2 | 160 | |
| Code Generation | MBPP | Accuracy42.1 | 159 | |
| Science Question Answering | SciQ | Normalized Accuracy85.7 | 137 | |
| Instruction Following | S-NI | Rouge-L37.2 | 119 |