Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

About

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://github.com/FrancescoMonaco/ZipCal.}.

Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy71.86	1896
Language Modeling	C4	Perplexity10.04	1688
Commonsense Reasoning	WinoGrande	Accuracy68.82	1442
Language Modeling	Wiki	Perplexity (PPL)7.21	298
Language Modeling	The Pile	Perplexity4.72	129
Natural Language Inference	aNLI	Accuracy47.58	107
Language Modeling	Wiki, C4, and Pile	Average Perplexity7.35	52
Language Understanding	MMLU-M	Accuracy27.4	29
Boolean Question Answering	BoolQ	Accuracy89.08	29
Multi-task Language Understanding	MMLU-M	Accuracy28.79	26

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord