Q8BERT: Quantized 8Bit BERT
About
Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by $4\times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Understanding | GLUE | SST-294.7 | 531 | |
| Question Answering | SQuAD v1.1 (dev) | F1 Score87.74 | 380 | |
| Natural Language Understanding | GLUE 1.0 (dev) | SST-2 (Acc)92.24 | 15 | |
| Text Embedding | MTEB | MTEB Quality Score65.8 | 8 |