Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

About

Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.

Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova• 2019

Related benchmarks

Task	Dataset	Result
Natural Language Understanding	GLUE	SST-290.1	551
Natural Language Understanding	GLUE (dev)	SST-2 (Acc)91.1	529
Natural Language Understanding	GLUE (test)	SST-2 Accuracy91.8	416
Question Answering	SQuAD v1.1 (dev)	F1 Score81.7	380
Text Embedding	MTEB English v2	Mean Score54	113
Conversational Recommendation	INSPIRED (test)	R@14.4	52
Long-context retrieval	MLDR	--	36
Natural Language Code Search	CodeSearchNet	Overall Score46	35
Natural Language Understanding	GLUE v1 (dev)	MRPC Score89.4	30
Text Embedding	MTEB v2	Clustering Score39.6	17

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord