Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Compressing Large Language Models using Low Rank and Low Precision Decomposition

About

The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces $\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$. Here, $\mathbf{L}$ and $\mathbf{R}$ are low rank factors, and the entries of $\mathbf{Q}$, $\mathbf{L}$ and $\mathbf{R}$ are quantized. The model is compressed by substituting each layer with its $\mathbf{Q} + \mathbf{L}\mathbf{R}$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, $\mathbf{L}$ and $\mathbf{R}$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. $\rm CALDERA$ obtains this decomposition by formulating it as an optimization problem $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$, where $\mathbf{X}$ is the calibration data, and $\mathbf{Q}, \mathbf{L}, \mathbf{R}$ are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of $\rm CALDERA$ are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-$2$ $7$B/$13B$/$70$B and LlaMa-$3$ $8$B models using $\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than $2.5$ bits per parameter. The implementation is available at: https://github.com/pilancilab/caldera.

Rajarshi Saha, Naomi Sagan, Varun Srivastava, Andrea J. Goldsmith, Mert Pilanci• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity5.55
1875
Language ModelingWikiText-2 (test)
PPL5.71
1541
Language ModelingC4
Perplexity9.56
1182
Question AnsweringARC Challenge--
749
Commonsense ReasoningPIQA
Accuracy79.92
647
Question AnsweringARC Easy
Normalized Acc75.21
385
Natural Language InferenceRTE
Accuracy87
367
Language ModelingWikiText2 v1 (test)
Perplexity3.98
341
Language ModelingC4 (test)
Perplexity8.44
268
Common Sense ReasoningWinoGrande
Accuracy89.19
156
Showing 10 of 14 rows

Other info

Code

Follow for update