QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
About
Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ($\le$ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric $E_8$ lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText2 | Perplexity4.9 | 1875 | |
| Language Modeling | WikiText-2 (test) | PPL2.99 | 1541 | |
| Commonsense Reasoning | HellaSwag | Accuracy63.51 | 1460 | |
| Language Modeling | C4 | Perplexity7.2 | 1182 | |
| Language Modeling | WikiText-2 | Perplexity (PPL)5.35 | 841 | |
| Commonsense Reasoning | WinoGrande | Accuracy76.8 | 776 | |
| Question Answering | ARC Challenge | Accuracy39.5 | 749 | |
| Commonsense Reasoning | PIQA | Accuracy78.45 | 647 | |
| Language Modeling | C4 (val) | PPL5.96 | 392 | |
| Question Answering | ARC Easy | Normalized Acc72.9 | 385 |