Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DistiLLM: Towards Streamlined Distillation for Large Language Models

About

Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun• 2024

Related benchmarks

TaskDatasetResultRank
Instruction FollowingDollyEval--
106
Instruction FollowingS-NI
Rouge-L37.2
94
Instruction FollowingUnNI
Rouge-L38.2
94
Commonsense ReasoningStrategyQA (test)
Accuracy62.8
81
Instruction FollowingVicunaEval
Rouge-L20.4
72
Instruction FollowingSelfInst
Rouge-L20.8
57
Abstractive dialogue summarizationSamSum (test)
ROUGE-L52.1
53
Instruction FollowingSelfInst
R-L Score10.8
50
Machine TranslationIWSLT en-de 2017 (test)
BLEU35.5
22
Instruction FollowingVicuna
Rouge-L17.1
6
Showing 10 of 11 rows

Other info

Follow for update