ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

About

The impressive performance of Large Language Models (LLMs) across various natural language processing tasks comes at the cost of vast computational resources and storage requirements. One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Yet, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression. In this paper, we introduce ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. ALPS substantially outperforms state-of-the-art methods in terms of the pruning objective and perplexity reduction, particularly for highly sparse models. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.

Xiang Meng, Kayhan Behdin, Haoyue Wang, Rahul Mazumder• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy53.37	1896
Language Modeling	C4	Perplexity7.99	1688
Question Answering	ARC Challenge	Accuracy40.61	906
Question Answering	ARC Easy	Accuracy72.9	597
Natural Language Inference	RTE	Accuracy57.76	590
Question Answering	BoolQ	Accuracy75.44	317
Language Modeling	Wiki	Perplexity (PPL)5.9	298
Question Answering	OpenBookQA	Accuracy30.8	145
Commonsense Reasoning	WinoGrande	Accuracy68.98	68
Zero-shot Accuracy	ARC Easy	Zero-shot Acc (ARC Easy)68.86	67

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord