NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies

About

The massive vocabulary sizes of large language models, often exceeding 100k tokens, impose a computational bottleneck on the final linear projection layer during speculative decoding. Existing vocabulary pruning solutions rely on static or coarsely-grained sub-vocabularies that necessitate large active sizes ($\sim$30k) to maintain draft quality. We propose NanoSpec, a novel training-free approach that breaks this trade-off by dynamically constructing a minimalist, context-aware active vocabulary for each generation step. Leveraging the inherent temporal locality of language generation, NanoSpec achieves high coverage while slashing the average vocabulary size by over $40\times$ (to $<$3k tokens) without requiring any auxiliary trained parameters. To realize the theoretical benefits of such high sparsity on modern hardware, we introduce a system-algorithm co-design that overcomes the inefficiencies of sparse memory access through asynchronous gathering and GPU-resident state management. As a complementary plug-and-play module, NanoSpec cuts draft time by an average of 51.6\%, delivering a $1.17$-$1.29\times$ end-to-end speedup over the state-of-the-art speculative decoding methods EAGLE-2 and EAGLE-3 across 7 tasks and outperforming complex training-based pruning baselines.

Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma• 2026

Related benchmarks

Task	Dataset	Result
Speculative Decoding	SpecBench	AVG SR900.7	47
Speculative Decoding	SpecBench and HumanEval	Throughput (tokens/s)392.7	5
Speculative Decoding	SpecBench Qwen-2-7B-Instruct (test)	Overall Mean Score3.48	5

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord