Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

About

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem• 2023

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy57.9
1398
Mathematical ReasoningMATH500 (test)--
895
Mathematical ReasoningMATH
Accuracy22.8
882
Instruction FollowingIFEval
IFEval Accuracy63.5
836
Science Question AnsweringScienceQA
Accuracy81.2
791
ReasoningBBH
Accuracy36.2
726
Code GenerationHumanEval (test)
Pass@140.09
612
Mathematical ReasoningMATH 500
Accuracy (Acc)62.8
543
Multi-turn Dialogue EvaluationMT-Bench--
532
Multitask Language UnderstandingMMLU
Accuracy34.08
520
Showing 10 of 202 rows
...

Other info

Follow for update