Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting

About

Post-training quantization (PTQ) has emerged as a widely adopted technique for compressing and accelerating Large Language Models (LLMs). The major challenge in LLM quantization is that uneven and heavy-tailed data distributions can expand the quantization range, thereby reducing bit precision for most values. Recent methods attempt to eliminate outliers and balance inter-channel differences by employing linear transformations; however, they remain heuristic and are often overlook optimizing the data distribution across the entire quantization space.In this paper, we introduce Quantization Space Utilization Rate (QSUR), a novel metric that effectively assesses the quantizability of transformed data by measuring the space utilization of the data in the quantization space. We complement QSUR with mathematical derivations that examine the effects and limitations of various transformations, guiding our development of Orthogonal and Scaling Transformation-based Quantization (OSTQuant). OSQuant employs a learnable equivalent transformation, consisting of an orthogonal transformation and a scaling transformation, to optimize the distributions of weights and activations across the entire quantization space. Futhermore, we propose the KL-Top loss function, designed to mitigate noise during optimization while retaining richer semantic information within the limited calibration data imposed by PTQ. OSTQuant outperforms existing work on various LLMs and benchmarks. In the W4-only setting, it retains 99.5\% of the floating-point accuracy. In the more challenging W4A4KV4 configuration, OSTQuant reduces the performance gap by 32\% on the LLaMA-3-8B model compared to state-of-the-art methods. \href{https://github.com/BrotherHappy/OSTQuant}{https://github.com/BrotherHappy/OSTQuant}.

Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, Sifan Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity3.19
3785
Language ModelingWikiText-2
Perplexity (PPL)5.26
2320
Commonsense ReasoningWinoGrande
Accuracy65.8
1442
Question AnsweringARC Challenge
Accuracy (ARC)54.03
598
Sentence CompletionHellaSwag
Accuracy77.23
364
Zero-shot ReasoningReasoning Suite Zero-shot (PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c) (val test)
Average Accuracy44.36
297
Question AnsweringARC Easy
Accuracy75.84
210
Language ModelingPerplexity
Perplexity (PPL)7.28
149
Zero-shot Common Sense ReasoningCommon Sense Reasoning
Zero-shot Accuracy64.92
137
Zero-shot EvaluationZero-shot Tasks Average
Accuracy65.41
95
Showing 10 of 22 rows

Other info

Follow for update