Prompt-based Depth Pruning of Large Language Models

About

Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent -- a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.

Juyun Wee, Minjae Park, Jaeho Lee• 2025

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	--	1896
Commonsense Reasoning	WinoGrande	--	1442
Natural Language Inference	RTE	Accuracy53.8	590
Question Answering	ARC-E	Accuracy27.3	523
Question Answering	BoolQ	--	317
Question Answering	ARC-C	Accuracy23.7	258
Recognizing Textual Entailment	RTE	Accuracy64.3	78
Natural Language Understanding	NLP Suite (BoolQ, RTE, HellaSwag, WinoG, ARC-E, ARC-C, OpenBookQA) zero-shot	Average Accuracy48.8	41
Question Answering	OBQA	Accuracy (Normalized)30	29
Science Question Answering	ARC Easy	Accuracy (Character-level)51.3	20

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord