Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Model-Preserving Adaptive Rounding

About

The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, most quantization algorithms minimize the immediate activation error of each layer as a proxy for the end-to-end error. However, this ignores the effect of future layers, making it a poor proxy. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that directly considers the error at the network's output. YAQA introduces a series of theoretical results that culminate in the first end-to-end error bounds for quantization algorithms. First, we characterize the convergence time of adaptive rounding algorithms via the structure of their Hessian approximations. We then show that the end-to-end error can be bounded by the approximation's cosine similarity to the true Hessian. This admits a natural Kronecker-factored approximation with corresponding near-optimal Hessian sketches. YAQA is provably better than GPTQ/LDLQ and empirically reduces the error by $\approx 30\%$ over these methods. YAQA even achieves a lower error than quantization aware training. This translates to state of the art performance on downstream tasks, all while adding no inference overhead.

Albert Tseng, Zhaofeng Sun, Christopher De Sa• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity3.18
1875
Language ModelingWikiText-2 (test)
PPL2.99
1541
Language ModelingC4
Perplexity5.02
1182
Language ModelingC4 (val)
PPL5.96
392
Language ModelingC4 (test)
Perplexity5.02
268
Zero-shot EvaluationZero-shot Evaluation Suite
ARCC50.2
14
Showing 6 of 6 rows

Other info

Follow for update