DistiLLM: Towards Streamlined Distillation for Large Language Models
About
Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | DollyEval | -- | 106 | |
| Instruction Following | S-NI | Rouge-L37.2 | 94 | |
| Instruction Following | UnNI | Rouge-L38.2 | 94 | |
| Commonsense Reasoning | StrategyQA (test) | Accuracy62.8 | 81 | |
| Instruction Following | VicunaEval | Rouge-L20.4 | 72 | |
| Instruction Following | SelfInst | Rouge-L20.8 | 57 | |
| Abstractive dialogue summarization | SamSum (test) | ROUGE-L52.1 | 53 | |
| Instruction Following | SelfInst | R-L Score10.8 | 50 | |
| Machine Translation | IWSLT en-de 2017 (test) | BLEU35.5 | 22 | |
| Instruction Following | Vicuna | Rouge-L17.1 | 6 |