Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Extreme Compression of Large Language Models via Additive Quantization

About

The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of "extreme" LLM compression-defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter-from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations: 1) learned additive quantization of weight matrices in input-adaptive fashion, and 2) joint optimization of codebook parameters across each transformer blocks. Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime. In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint.

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity9.74
1875
Commonsense ReasoningHellaSwag
Accuracy63.69
1460
Language ModelingC4
Perplexity7.2
1182
Language ModelingWikiText-2
Perplexity (PPL)5.41
841
Commonsense ReasoningWinoGrande
Accuracy76.48
776
Language UnderstandingMMLU
Accuracy65.6
756
Question AnsweringARC Challenge
Accuracy37.8
749
Commonsense ReasoningPIQA
Accuracy76.2
647
Question AnsweringARC Easy
Normalized Acc69.8
385
Video UnderstandingVideoMME--
192
Showing 10 of 51 rows

Other info

Follow for update