BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
About
Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision performance for BERT, with an average 3.5% drop on GLUE and less than 2% drop on five tasks, and achieves comparable perplexity and accuracy for LLMs. In efficiency, it delivers 16 to 24 times kernel-level speedup over FP16 on NVIDIA GPUs, and 216 to 330 tokens/s end-to-end prefill speedup with lower memory footprint on LLMs. As an algorithm-hardware co-design, BWTA demonstrates practical, low-latency ultra-low-bit inference without sacrificing model quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText2 | Perplexity15.58 | 2839 | |
| Language Modeling | C4 | Perplexity16.09 | 1422 | |
| Matrix Multiplication Latency | Synthetic Matrix Multiplication Shapes | Latency (µs)9.24 | 198 | |
| Commonsense Reasoning | CommonsenseQA | Accuracy (pass@1)44.8 | 45 | |
| Natural Language Understanding | GLUE | SST-292.9 | 20 | |
| Large Language Model Inference | Decode Phase BS=1 | Latency (s)0.152 | 18 | |
| Commonsense Question Answering | Commonsense QA | BoolQ Accuracy63.7 | 17 | |
| Large Language Model Inference | Prefill Phase SeqLen=2k | Prefill Time (s)0.025 | 15 | |
| Matrix Multiplication | Synthetic Transformer Shapes Attention-Value Att ⊗ V | Latency (µs)6.38 | 9 | |
| Matrix Multiplication | Synthetic Transformer Shapes Query-Key Q ⊗ K⊤ | Latency (µs)5.41 | 9 |