EvoPress: Accurate Dynamic Model Compression via Evolutionary Search

About

The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on estimating the importance of a given layer, implicitly assuming that layers contribute independently to the overall compression error. We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance. To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/EvoPress}.

Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh• 2024

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2 (test)	PPL5.48	2333
Language Modeling	WikiText-2	Perplexity (PPL)7.64	2320
Language Modeling	C4	Perplexity17.58	1688
Language Modeling	C4	Perplexity12.53	1565
Language Modeling	C4 (val)	PPL7.65	737
Language Modeling	WikiText2 (val)	Perplexity (PPL)5.42	423
Language Modeling	WikiText2 v1 (test)	Perplexity5.74	383
Language Modeling	Wiki	Perplexity (PPL)28.76	298
Zero-shot Reasoning	Reasoning Suite Zero-shot (PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c) (val test)	Average Accuracy64.67	297
Language Modeling	WikiText2	Perplexity86.94	277

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord