Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

About

As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.

Baohao Liao, Christian Herold, Seyyed Hadi Hashemi, Stefan Vasilev, Shahram Khadivi, Christof Monz• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity9.7
2839
Language ModelingWikiText-2 (test)
PPL3.72
1949
Commonsense ReasoningHellaSwag
Accuracy79
1891
Language ModelingC4
Perplexity13.6
1422
Commonsense ReasoningWinoGrande
Accuracy72.4
1085
Language ModelingPTB
Perplexity17.6
1034
Question AnsweringARC Challenge
Accuracy40.7
906
Commonsense ReasoningPIQA
Accuracy79.9
751
Language ModelingC4 (val)
PPL5.86
514
Question AnsweringARC-E
Accuracy78.9
416
Showing 10 of 25 rows

Other info

Follow for update