BitNet: Scaling 1-bit Transformers for Large Language Models

About

The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei• 2023

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity11.2	3785
Language Modeling	C4	Perplexity11.17	1565
Commonsense Reasoning	CommonsenseQA	Accuracy (pass@1)47.2	108
Zero-shot Reasoning	Reasoning Suite (ARC-e, ARC-c, HellaSwag, PIQA, Winogrande) zero-shot	Average Reasoning Score0.4621	107
Commonsense Question Answering	Commonsense QA	BoolQ Accuracy62	29
Large Language Model Inference	Decode Phase BS=1	Latency (s)0.158	18
Large Language Model Inference	Prefill Phase SeqLen=2k	Prefill Time (s)0.026	15
Zero-shot Learning	Reasoning Suite Zero-shot (ARC-e, ARC-c, WG, BQ, PIQA, HS, OBQA, HQA)	ARC-e Accuracy48.7	9

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord