Sequence-Level Knowledge Distillation

About

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.

Yoon Kim, Alexander M. Rush• 2016

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy51.33	1896
Commonsense Reasoning	WinoGrande	Accuracy62.19	1442
Mathematical Reasoning	GSM8K	Accuracy60.94	1398
Automatic Speech Recognition	LibriSpeech clean (test)	WER4.23	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER17.36	1206
Mathematical Reasoning	MATH	Accuracy20.8	882
Instruction Following	IFEval	IFEval Accuracy62.4	836
Commonsense Reasoning	PIQA	Accuracy71.55	757
Reasoning	BBH	Accuracy36.6	726
Code Generation	HumanEval (test)	Pass@141.5	612

Showing 10 of 158 rows

...

Other info

Code

Follow for update

@wizwand_team Discord