Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

About

Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.

Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han• 2026

Related benchmarks

TaskDatasetResultRank
Long-context Generation (Reasoning)AIME24
TPT46.2
20
Long-context Generation (Reasoning)AIME25
TPT49.96
20
Long-context Input (Summarization)GovReport
Time Per Token (TPT)13
20
Long-context Input (Summarization)PG19
TPT10.67
20
Long-context Input (Summarization)BookSum
TPT (s)6.99
20
Long-context Generation (Reasoning)MMLU-Pro
TPT34.62
20
Showing 6 of 6 rows

Other info

Follow for update