Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TinyBERT: Distilling BERT for Natural Language Understanding

About

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu• 2019

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU40
2731
Natural Language InferenceSNLI (test)
Accuracy78.25
681
Natural Language UnderstandingGLUE (dev)
SST-2 (Acc)93
504
Natural Language UnderstandingGLUE (test)
SST-2 Accuracy93.1
416
Question AnsweringSQuAD v1.1 (dev)
F1 Score87.5
375
Visual EntailmentSNLI-VE (test)
Overall Accuracy73.31
197
Image RetrievalFlickr30k (test)
R@140.8
195
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy0.792
191
Text ClassificationSST-2 (test)
Accuracy92.6
185
Natural Language UnderstandingGLUE (val)
SST-289.7
170
Showing 10 of 62 rows

Other info

Code

Follow for update