DistiLLM: Towards Streamlined Distillation for Large Language Models

About

Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy57.2	1398
Mathematical Reasoning	MATH	Accuracy21.2	882
Instruction Following	IFEval	IFEval Accuracy62.2	836
Reasoning	BBH	Accuracy36.5	726
Code Generation	HumanEval (test)	Pass@142.1	612
Multitask Language Understanding	MMLU	Accuracy33	520
Instruction Following	AlpacaEval	Win Rate75.11	420
Arithmetic Reasoning	GSM8K	Accuracy0.00e+0	272
Logical reasoning	BBH	Accuracy36.5	249
Code Generation	MBPP+	Pass@162.7	238

Showing 10 of 86 rows

...

Other info

Follow for update

@wizwand_team Discord