ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

About

As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.

Baohao Liao, Christian Herold, Seyyed Hadi Hashemi, Stefan Vasilev, Shahram Khadivi, Christof Monz• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity9.7	3785
Language Modeling	WikiText-2 (test)	PPL3.72	2333
Commonsense Reasoning	HellaSwag	Accuracy79	1896
Language Modeling	C4	Perplexity13.6	1565
Commonsense Reasoning	WinoGrande	Accuracy72.4	1442
Language Modeling	PTB	Perplexity17.6	1234
Question Answering	ARC Challenge	Accuracy40.7	906
Commonsense Reasoning	PIQA	Accuracy79.9	757
Language Modeling	C4 (val)	PPL5.86	737
Question Answering	ARC-E	Accuracy78.9	523

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord