Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

About

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.

Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun• 2025

Related benchmarks

TaskDatasetResultRank
Speculative DecodingSpec-Bench
MT Score195.6
48
Question AnsweringQA
Speedup Factor2.02
17
Language Model DecodingSpec-Bench
Conv. Acc234.2
11
Speculative Decoding ThroughputSpec-Bench
Throughput (Conv.)474
10
DecodingMulti-task Evaluation Suite Llama-3.2-1B (test)
MT Throughput (token/s)394.8
6
Speculative DecodingSpec-Bench OLMo 2 7B
Conversation Score4.72
5
Code GenerationCode
Throughput (token/s)183.5
3
ConversationConv.
Throughput (token/s)212.1
3
Machine TranslationMT
Throughput (tokens/s)188.7
3
Mathematical ReasoningMATH
Throughput (tokens/s)238
3
Showing 10 of 12 rows

Other info

Follow for update