Autoregressive Knowledge Distillation through Imitation Learning

About

The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

Alexander Lin, Jeremy Wohlwend, Howard Chen, Tao Lei• 2020

Related benchmarks

Task	Dataset	Result
Instruction Following	UnNI	Rouge-L28.7	178
Instruction Following	S-NI	Rouge-L33.1	119
Commonsense Reasoning	StrategyQA (test)	Accuracy61.7	119
Instruction Following	DollyEval	Rouge-L25.3	114
Instruction Following	Vicuna	Rouge-L16	101
Instruction Following	SelfInst	Rouge-L18.4	73
Instruction Following	VicunaEval	Rouge-L19.1	72
Abstractive dialogue summarization	SamSum (test)	ROUGE-L51.2	53
Mathematical Reasoning	GSM-Plus (test)	Accuracy21.3	50
Text Summarization	DialogueSUM (test)	ROUGE-L35.1	49

Showing 10 of 38 rows

Other info

Code

Follow for update

@wizwand_team Discord