KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

About

Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.

Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han• 2026

Related benchmarks

Task	Dataset	Result
Long-context Input (Summarization)	GovReport	Speedup1.47	31
Long-context Generation (Reasoning)	AIME24	TPT46.2	20
Long-context Generation (Reasoning)	AIME25	TPT49.96	20
Long-context Input (Summarization)	PG19	TPT10.67	20
Long-context Input (Summarization)	BookSum	TPT (s)6.99	20
Long-context Generation (Reasoning)	MMLU-Pro	TPT34.62	20

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord