f-Divergence Minimization for Sequence-Level Knowledge Distillation

About

Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our f-DISTILL methods. We further derive step-wise decomposition for our f-DISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.

Yuqiao Wen, Zichao Li, Wenyu Du, Lili Mou• 2023

Related benchmarks

Task	Dataset	Result
Multitask Language Understanding	MMLU	Accuracy27.41	520
Arithmetic Reasoning	GSM8K	Accuracy0.00e+0	272
Logical reasoning	BBH	Accuracy23.55	249
General Reasoning	BBH	Accuracy35.48	190
General Reasoning	MMLU	MMLU Accuracy60.4	180
Instruction Following	UnNI	Rouge-L25.24	178
Code Generation	HumanEval+ (test)	Pass@135.37	132
Instruction Following	S-NI	Rouge-L24.58	119
Instruction Following	DollyEval	Rouge-L23.88	114
Reasoning	Reasoning Benchmarks BBH, MMLU, ARC-C, ThmQA (test)	BBH40.5	66

Showing 10 of 32 rows

Other info

Code

Follow for update

@wizwand_team Discord