BiT: Robustly Binarized Multi-distilled Transformer
About
Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however, is technically challenging from an optimization perspective. In this work, we identify a series of improvements that enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%. Code and models are available at: https://github.com/facebookresearch/bit.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText2 | Perplexity20.57 | 2839 | |
| Language Modeling | C4 | Perplexity22.31 | 1422 | |
| Natural Language Understanding | GLUE (test dev) | MRPC Accuracy79.7 | 87 | |
| Commonsense Reasoning | CommonsenseQA | Accuracy (pass@1)42.5 | 45 | |
| Natural Language Understanding | GLUE | SST-292.3 | 20 | |
| Commonsense Question Answering | Commonsense QA | BoolQ Accuracy59.9 | 17 |