Prompt-based Depth Pruning of Large Language Models
About
Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent -- a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | -- | 1891 | |
| Commonsense Reasoning | WinoGrande | -- | 1085 | |
| Natural Language Inference | RTE | Accuracy53.8 | 448 | |
| Question Answering | ARC-E | Accuracy27.3 | 416 | |
| Question Answering | BoolQ | -- | 317 | |
| Question Answering | ARC-C | Accuracy23.7 | 192 | |
| Recognizing Textual Entailment | RTE | Accuracy64.3 | 47 | |
| Natural Language Understanding | NLP Suite (BoolQ, RTE, HellaSwag, WinoG, ARC-E, ARC-C, OpenBookQA) zero-shot | Average Accuracy48.8 | 41 | |
| Science Question Answering | ARC Easy | Accuracy (Character-level)51.3 | 20 | |
| Science Question Answering | ARC Challenge | Accuracy (ARC)32.1 | 19 |