Model-Preserving Adaptive Rounding
About
The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, most quantization algorithms minimize the immediate activation error of each layer as a proxy for the end-to-end error. However, this ignores the effect of future layers, making it a poor proxy. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that directly considers the error at the network's output. YAQA introduces a series of theoretical results that culminate in the first end-to-end error bounds for quantization algorithms. First, we characterize the convergence time of adaptive rounding algorithms via the structure of their Hessian approximations. We then show that the end-to-end error can be bounded by the approximation's cosine similarity to the true Hessian. This admits a natural Kronecker-factored approximation with corresponding near-optimal Hessian sketches. YAQA is provably better than GPTQ/LDLQ and empirically reduces the error by $\approx 30\%$ over these methods. YAQA even achieves a lower error than quantization aware training. This translates to state of the art performance on downstream tasks, all while adding no inference overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText2 | Perplexity3.18 | 1875 | |
| Language Modeling | WikiText-2 (test) | PPL2.99 | 1541 | |
| Language Modeling | C4 | Perplexity5.02 | 1182 | |
| Language Modeling | C4 (val) | PPL5.96 | 392 | |
| Language Modeling | C4 (test) | Perplexity5.02 | 268 | |
| Zero-shot Evaluation | Zero-shot Evaluation Suite | ARCC50.2 | 14 |