Achieving binary weight and activation for LLMs using Post-Training Quantization

About

Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models. Code is available at https://github.com/JimmyCrave/LLM-PTQ-binarization.

Siqing Song, Chuang Wang, Ruiqi Wang, Yi Yang, Xu-Yao Zhang• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity7.17	3785
Language Modeling	C4	Perplexity10.18	1565
Language Modeling	PTB	Perplexity37.2	1234
Multiple-choice Question Answering	HellaSwag	Accuracy55.76	196
Language Understanding	MMLU (test)	MMLU Average Accuracy28	167
Question Answering	QA Suite Zero-shot (PIQA, ARC-E, ARC-C, BoolQ, HellaSwag, WinoGrande)	PIQA Accuracy72.09	141
Language Modeling	Penn Treebank (PTB) (test)	Perplexity69.46	130
Commonsense Question Answering	WinoGrande	Accuracy58.01	73
Commonsense Question Answering	ARC-E	Accuracy46.13	29
Commonsense Question Answering	ARC Challenge	Accuracy30.55	21

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord