Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing
About
We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing-using just 1.5% of FLOPs-can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of runtime reduction compared to the state-of-the-art method at a 40% pruning ratio. Our code is available at https://github.com/Qi-Le1/Probe_Pruning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText | PPL3.61 | 479 | |
| Commonsense Reasoning | Common Sense Reasoning Tasks | Avg Score63.34 | 241 | |
| Commonsense Reasoning | Commonsense Reasoning | Accuracy63.34 | 44 | |
| Question Answering | 7 QA tasks | Accuracy66.89 | 42 | |
| Commonsense Reasoning | Commonsense Reasoning Suite BoolQ, PIQA, HellaS, WinoG, ARC-e, ARC-c, OBQA | Average Accuracy58.66 | 37 | |
| Text Generation | 5 Generation tasks | Accuracy41.83 | 36 | |
| Text Generation | Text Generation | PPL15.31 | 33 | |
| Throughput Measurement | LLaMA-2 13B | Throughput (tokens/s)19.2 | 20 | |
| Language Modeling | Perplexity Evaluation zero-shot | PPL (zero-shot)12.52 | 17 | |
| Commonsense Reasoning | Commonsense Reasoning Benchmarks zero-shot LLaMA-2-13B | BoolQ Accuracy (Zero-shot)76.29 | 17 |