DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

About

Speculative decoding accelerates LLM inference by letting a small drafter propose multiple tokens which a large target model verifies once per speculation step. As vocabularies scale past 10e5 tokens,verification cost in the target model is largely unchanged, but the drafter can become bottlenecked by its O(|V|d) output projection. Recent approaches (e.g., FR-Spec, VocabTrim) mitigate this by restricting drafting to a fixed, frequency-ranked shortlist; however, such static truncation is corpus-dependent and suppresses rare or domain-specific tokens, reducing acceptance and limiting speedups. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism for large-vocabulary speculative decoding. DynaSpec trains lightweight meta-classifiers that route each context to a small set of coarse token clusters; the union of the top-selected clusters defines the drafter's shortlist, while the target model still verifies over the full vocabulary, preserving exactness. Systems-wise, routing is overlapped with draft computation via parallel execution streams, reducing end-to-end overhead. Across standard speculative decoding benchmarks, DynaSpec consistently improves mean accepted length-recovering 98.4% of full-vocabulary performance for Llama-3-8B versus 93.6% for fixed-shortlist baselines-and achieves up to a 2.23x throughput gain compared to 1.91x for static approaches on the dataset with rare tokens.

Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar• 2025

Related benchmarks

Task	Dataset	Result
Speculative Decoding	SpecBench	AVG SR799.2	47
Speculative Decoding	SpecBench and HumanEval	Throughput (tokens/s)378.1	5
Speculative Decoding	SpecBench Qwen-2-7B-Instruct (test)	Overall Mean Score3.46	5

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord