Reducing Transformer Depth on Demand with Structured Dropout

About

Overparameterized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. Moreover, we show that our approach leads to small BERT-like models of higher quality compared to training from scratch or using distillation.

Angela Fan, Edouard Grave, Armand Joulin• 2019

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-103 (test)	Perplexity17.7	703
Natural Language Understanding	GLUE	SST-294.7	551
Machine Translation	WMT En-De 2014 (test)	BLEU30.2	379
Image Classification	ImageNet (val)	Accuracy81.8	300
Language Modeling	WikiText-103 (val)	PPL18.1	261
Natural Language Understanding	GLUE (val)	SST-296.8	201
Abstractive Text Summarization	CNN/Daily Mail (test)	ROUGE-L37.5	169
Machine Translation	IWSLT En-De 2014 (test)	BLEU34.5	92
Natural Language Understanding	GLUE (test)	QNLI93.9	64
Long-form Question Answering	ELI5 (test)	ROUGE-L23.4	54

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord