Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

About

Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.

Yifu Ding, Xianglong Liu, Shenghao Jin, Jinyang Guo, Jiwen Lu• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity15.58
2839
Language ModelingC4
Perplexity16.09
1422
Matrix Multiplication LatencySynthetic Matrix Multiplication Shapes
Latency (µs)9.24
198
Commonsense ReasoningCommonsenseQA
Accuracy (pass@1)44.8
45
Natural Language UnderstandingGLUE
SST-292.9
20
Large Language Model InferenceDecode Phase BS=1
Latency (s)0.152
18
Commonsense Question AnsweringCommonsense QA
BoolQ Accuracy63.7
17
Large Language Model InferencePrefill Phase SeqLen=2k
Prefill Time (s)0.025
15
Matrix MultiplicationSynthetic Transformer Shapes Attention-Value Att ⊗ V
Latency (µs)6.38
9
Matrix MultiplicationSynthetic Transformer Shapes Query-Key Q ⊗ K⊤
Latency (µs)5.41
9
Showing 10 of 11 rows

Other info

Follow for update