SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
About
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches. The code is available at: https://github.com/IST-DASLab/sparsegpt.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText2 | Perplexity5.84 | 3785 | |
| Language Modeling | WikiText-2 (test) | PPL8.32 | 2333 | |
| Language Modeling | WikiText-2 | Perplexity (PPL)4.25 | 2320 | |
| Object Hallucination Evaluation | POPE | Accuracy88.21 | 2019 | |
| Commonsense Reasoning | HellaSwag | Accuracy52.7 | 1896 | |
| Visual Question Answering | VizWiz | Accuracy65.28 | 1820 | |
| Language Modeling | C4 | Perplexity8.22 | 1688 | |
| Language Modeling | C4 | Perplexity27.62 | 1565 | |
| Visual Question Answering | TextVQA | -- | 1453 | |
| Commonsense Reasoning | WinoGrande | Accuracy78.85 | 1442 |