Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

About

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression techniques have been proposed, including quantization, pruning, and knowledge distillation. Among these, post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration, enabling low-cost deployment. Recent advances for post-training quantization have demonstrated that even near 4-bit methods can maintain most of the original model performance. However, 1-bit quantization remains particularly challenging. A common strategy in 1-bit quantization is to determine binary weights by matching full-precision parameters, following a weight-driven criterion. However, this objective is not directly aligned with the quantized model's objective, which is to preserve the model's output behavior under the impact of quantization. A natural alternative is to adopt output-driven criteria that minimize discrepancies in model outputs using calibration data. Surprisingly, naive output-driven approaches often perform even worse in the 1-bit regime. In this paper, we show that this failure arises from two fundamental issues: error accumulation across layers and, more critically, \emph{anisotropic distortion} of the representation space. Based on these insights, we propose a novel PTQ method for 1-bit LLMs that explicitly addresses these issues while maintaining computational efficiency. Extensive experiments demonstrate that our approach consistently outperforms existing 1-bit PTQ methods.

Dung Anh Hoang, Cuong Pham, Cuong Nguyen, Trung le, Jianfei Cai, Thanh-Toan Do• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity10.94
3785
Language ModelingWikiText-2--
2320
Language ModelingC4
Perplexity13.15
1565
Language ModelingPTB
Perplexity16.75
1234
Zero-shot Question AnsweringAveQA
Accuracy57.7
25
Language ModelingC4
Perplexity (LLaMA-2 7B/8B)19.25
6
Question AnsweringQA Benchmarks Zero-shot (BoolQ, Lambada, Piqa, OPQA, Winogrande, ARC-E, ARC-C, Hellaswag)
BoolQ Accuracy72.02
6
Language ModelingPTB
Perplexity (LLaMA-2 7/8B)3.17e+3
6
Showing 8 of 8 rows

Other info

Follow for update