Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

About

This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from $\textit{incoherent}$ weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at https://github.com/Cornell-RelaxML/QuIP.

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa• 2023

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity3.45
2839
Language ModelingWikiText-2 (test)
PPL5.01
1949
Language ModelingWikiText-2
Perplexity (PPL)5.94
1624
Language ModelingC4
Perplexity5.6
1422
Mathematical ReasoningGSM8K
Accuracy0.00e+0
1362
Language ModelingC4
Perplexity11.03
1071
Code GenerationHumanEval
Pass@10.00e+0
1036
Language ModelingPTB
Perplexity13.4
1034
Multi-task Language UnderstandingMMLU--
876
Physical Commonsense ReasoningPIQA
Accuracy77.97
572
Showing 10 of 33 rows

Other info

Code

Follow for update