BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

About

Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.

Yifu Ding, Xianglong Liu, Shenghao Jin, Jinyang Guo, Jiwen Lu• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity15.58	3785
Language Modeling	C4	Perplexity16.09	1565
Matrix Multiplication Latency	Synthetic Matrix Multiplication Shapes	Latency (µs)9.24	198
Commonsense Reasoning	CommonsenseQA	Accuracy (pass@1)44.8	108
Natural Language Understanding	GLUE	SST-292.9	40
Commonsense Question Answering	Commonsense QA	BoolQ Accuracy63.7	29
Large Language Model Inference	Decode Phase BS=1	Latency (s)0.152	18
Large Language Model Inference	Prefill Phase SeqLen=2k	Prefill Time (s)0.025	15
Matrix Multiplication	Synthetic Transformer Shapes Attention-Value Att ⊗ V	Latency (µs)6.38	9
Matrix Multiplication	Synthetic Transformer Shapes Query-Key Q ⊗ K⊤	Latency (µs)5.41	9

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord