Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

About

As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.

Baohao Liao, Christian Herold, Seyyed Hadi Hashemi, Stefan Vasilev, Shahram Khadivi, Christof Monz• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity9.7
1875
Language ModelingWikiText-2 (test)
PPL3.72
1541
Commonsense ReasoningHellaSwag
Accuracy79
1460
Language ModelingC4
Perplexity13.6
1182
Commonsense ReasoningWinoGrande
Accuracy72.4
776
Question AnsweringARC Challenge
Accuracy40.7
749
Language ModelingPTB
Perplexity17.6
650
Commonsense ReasoningPIQA
Accuracy79.9
647
Language ModelingC4 (val)
PPL5.86
392
Question AnsweringARC Easy
Normalized Acc74.9
385
Showing 10 of 25 rows

Other info

Follow for update