Model-Preserving Adaptive Rounding

About

The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, most quantization algorithms minimize the immediate activation error of each layer as a proxy for the end-to-end error. However, this ignores the effect of future layers, making it a poor proxy. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that directly considers the error at the network's output. YAQA introduces a series of theoretical results that culminate in the first end-to-end error bounds for quantization algorithms. First, we characterize the convergence time of adaptive rounding algorithms via the structure of their Hessian approximations. We then show that the end-to-end error can be bounded by the approximation's cosine similarity to the true Hessian. This admits a natural Kronecker-factored approximation with corresponding near-optimal Hessian sketches. YAQA is provably better than GPTQ/LDLQ and empirically reduces the error by $\approx 30\%$ over these methods. YAQA even achieves a lower error than quantization aware training. This translates to state of the art performance on downstream tasks, all while adding no inference overhead.

Albert Tseng, Zhaofeng Sun, Christopher De Sa• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity3.18	3785
Language Modeling	WikiText-2 (test)	PPL2.99	2333
Language Modeling	C4	Perplexity5.02	1565
Language Modeling	C4 (val)	PPL5.96	737
Language Modeling	C4 (test)	Perplexity5.02	464
Zero-shot Evaluation	Zero-shot Evaluation Suite	ARCC50.2	14

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord