Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ZipLM: Inference-Aware Structured Pruning of Language Models

About

The breakthrough performance of large language models (LLMs) comes with major computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a novel structured compression approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. Specifically, given a model, a dataset, an inference environment, as well as a set of speedup targets, ZipLM iteratively identifies and removes components with the worst loss-runtime trade-off. Unlike prior methods that specialize in either the post-training/one-shot or the gradual compression setting, and only for specific families of models such as BERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed models across all these settings. Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications. In particular, ZipLM outperforms all prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and TinyBERT. Moreover, it matches the performance of the heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large model. When compressing GPT2, ZipLM outperforms DistilGPT2 while being 60% smaller and 30% faster. Our code is available at: https://github.com/IST-DASLab/ZipLM.

Eldar Kurtic, Elias Frantar, Dan Alistarh• 2023

Related benchmarks

TaskDatasetResultRank
Language UnderstandingMMLU
Accuracy38.08
756
Language ModelingWikiText-103 (test)
Perplexity35.4
524
Natural Language UnderstandingGLUE (dev)
SST-2 (Acc)91.7
504
Natural Language UnderstandingGLUE (test)
SST-2 Accuracy91.8
416
Question AnsweringSQuAD v1.1 (dev)
F1 Score85.7
375
Language ModelingWikiText2 2016 (test)
Perplexity3.95
88
Zero-shot Downstream Task EvaluationARC-c, ARC-e, WinoGrande, BoolQ, HellaSwag, OpenBookQA, PIQA, MMLU standard (test val)
Average Accuracy0.7054
88
AccuracyLLaMA2-7B zero-shot--
16
Language ModelingLanguage Modeling Dataset
PPL9.17
13
Zero-shot Learning7 Downstream Tasks Avg
Average Score56.88
4
Showing 10 of 10 rows

Other info

Code

Follow for update