FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

About

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.

Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun• 2025

Related benchmarks

Task	Dataset	Result
Speculative Decoding	Spec-Bench	MT Score195.6	57
Speculative Decoding	SpecBench	AVG SR778.8	47
Question Answering	QA	Speedup Factor2.02	47
Speculative Decoding	HumanEval	--	36
Speculative Decoding	Code	Throughput (tokens/s)123.6	22
Speculative Decoding	Med	Throughput (tokens/s)114.1	22
Speculative Decoding	Law	Throughput (tokens/s)114.6	22
Speculative Decoding Inference	PubMedQA	Throughput (tokens/s)160.6	12
Speculative Decoding Inference	Specialized Datasets Aggregate	Average Speed (tokens/s)152.7	12
Speculative Decoding Inference	Pile of Law	Inference Speed (tokens/s)160.5	12

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord