TinyBERT: Distilling BERT for Natural Language Understanding
About
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU40 | 2731 | |
| Natural Language Inference | SNLI (test) | Accuracy78.25 | 681 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)93 | 504 | |
| Natural Language Understanding | GLUE (test) | SST-2 Accuracy93.1 | 416 | |
| Question Answering | SQuAD v1.1 (dev) | F1 Score87.5 | 375 | |
| Visual Entailment | SNLI-VE (test) | Overall Accuracy73.31 | 197 | |
| Image Retrieval | Flickr30k (test) | R@140.8 | 195 | |
| Image Classification | ImageNet-1k 1.0 (test) | Top-1 Accuracy0.792 | 191 | |
| Text Classification | SST-2 (test) | Accuracy92.6 | 185 | |
| Natural Language Understanding | GLUE (val) | SST-289.7 | 170 |