DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
About
Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH | Accuracy15.07 | 882 | |
| Code Generation | HumanEval (test) | Pass@142.7 | 612 | |
| Instruction Following | AlpacaEval | Win Rate73.66 | 420 | |
| Code Generation | MBPP+ | Pass@163.2 | 238 | |
| Mathematical Reasoning | AMC23 | PASS@1 Accuracy30 | 207 | |
| Code Generation | MBPP | Pass@145.63 | 193 | |
| Code Generation | HumanEval | Pass@138.14 | 171 | |
| Code Generation | HumanEval+ (test) | Pass@138.4 | 132 | |
| Mathematical Reasoning | AIME 24 | Pass@1 Accuracy13.3 | 128 | |
| Instruction Following | DollyEval | -- | 114 |