BiT: Robustly Binarized Multi-distilled Transformer

About

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however, is technically challenging from an optimization perspective. In this work, we identify a series of improvements that enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%. Code and models are available at: https://github.com/facebookresearch/bit.

Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li, Raghuraman Krishnamoorthi, Yashar Mehdad• 2022

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity20.57	3785
Language Modeling	C4	Perplexity22.31	1565
Natural Language Understanding	GLUE	SST-291.5	551
Natural Language Understanding	GLUE (dev)	--	529
Commonsense Reasoning	CommonsenseQA	Accuracy (pass@1)42.5	108
Natural Language Understanding	GLUE (test dev)	MRPC Accuracy79.7	90
Natural Language Understanding	GLUE	SST-292.3	40
Commonsense Question Answering	Commonsense QA	BoolQ Accuracy59.9	29

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord