Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

About

Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing their weights to low-precision. In this work, we introduce QuIP#, a weight-only PTQ method that achieves state-of-the-art results in extreme compression regimes ($\le$ 4 bits per weight) using three novel techniques. First, QuIP# improves QuIP's (Chee et al., 2023) incoherence processing by using the randomized Hadamard transform, which is faster and has better theoretical properties. Second, QuIP# uses vector quantization to take advantage of the ball-shaped sub-Gaussian distribution that incoherent weights possess: specifically, we introduce a set of hardware-efficient codebooks based on the highly symmetric $E_8$ lattice, which achieves the optimal 8-dimension unit ball packing. Third, QuIP# uses fine-tuning to improve fidelity to the original model. Our experiments show that QuIP# outperforms existing PTQ methods, enables new behaviors in PTQ scaling, and supports fast inference. Our code can be found at https://github.com/Cornell-RelaxML/quip-sharp.

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity4.9
1875
Language ModelingWikiText-2 (test)
PPL2.99
1541
Commonsense ReasoningHellaSwag
Accuracy63.51
1460
Language ModelingC4
Perplexity7.2
1182
Language ModelingWikiText-2
Perplexity (PPL)5.35
841
Commonsense ReasoningWinoGrande
Accuracy76.8
776
Question AnsweringARC Challenge
Accuracy39.5
749
Commonsense ReasoningPIQA
Accuracy78.45
647
Language ModelingC4 (val)
PPL5.96
392
Question AnsweringARC Easy
Normalized Acc72.9
385
Showing 10 of 48 rows

Other info

Follow for update