SlimGPT: Layer-wise Structured Pruning for Large Language Models

About

Large language models (LLMs) have garnered significant attention for their remarkable capabilities across various domains, whose vast parameter scales present challenges for practical deployment. Structured pruning is an effective method to balance model performance with efficiency, but performance restoration under computational resource constraints is a principal challenge in pruning LLMs. Therefore, we present a low-cost and fast structured pruning method for LLMs named SlimGPT based on the Optimal Brain Surgeon framework. We propose Batched Greedy Pruning for rapid and near-optimal pruning, which enhances the accuracy of head-wise pruning error estimation through grouped Cholesky decomposition and improves the pruning efficiency of FFN via Dynamic Group Size, thereby achieving approximate local optimal pruning results within one hour. Besides, we explore the limitations of layer-wise pruning from the perspective of error accumulation and propose Incremental Pruning Ratio, a non-uniform pruning strategy to reduce performance degradation. Experimental results on the LLaMA benchmark show that SlimGPT outperforms other methods and achieves state-of-the-art results.

Gui Ling, Ziyang Wang, Yuliang Yan, Qingwen Liu• 2024

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity7.69	3785
Language Modeling	WikiText-2 (test)	PPL17.73	2333
Language Modeling	WikiText-2	Perplexity (PPL)16.68	2320
Commonsense Reasoning	HellaSwag	Accuracy47.81	1896
Language Modeling	C4	Perplexity11.41	1688
Language Modeling	PTB	Perplexity37.8	1234
Question Answering	ARC Challenge	--	906
Multi-task Language Understanding	MMLU	--	881
Question Answering	ARC Easy	Accuracy58.29	597
Question Answering	PIQA	Accuracy71.33	505

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord