Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TinyBERT: Distilling BERT for Natural Language Understanding

About

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu• 2019

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU40
2888
Natural Language InferenceSNLI (test)
Accuracy78.25
690
Natural Language UnderstandingGLUE
SST-291.2
531
Natural Language UnderstandingGLUE (dev)
SST-2 (Acc)93
518
Natural Language UnderstandingGLUE (test)
SST-2 Accuracy93.1
416
Question AnsweringSQuAD v1.1 (dev)
F1 Score87.5
380
Image RetrievalFlickr30k (test)
R@140.8
210
Visual EntailmentSNLI-VE (test)
Overall Accuracy73.31
197
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy0.792
191
Natural Language UnderstandingGLUE (val)
SST-289.7
191
Showing 10 of 73 rows
...

Other info

Code

Follow for update