FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
About
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speculative Decoding | Spec-Bench | MT Score195.6 | 57 | |
| Speculative Decoding | SpecBench | AVG SR778.8 | 47 | |
| Question Answering | QA | Speedup Factor2.02 | 47 | |
| Speculative Decoding | HumanEval | -- | 36 | |
| Speculative Decoding | Code | Throughput (tokens/s)123.6 | 22 | |
| Speculative Decoding | Med | Throughput (tokens/s)114.1 | 22 | |
| Speculative Decoding | Law | Throughput (tokens/s)114.6 | 22 | |
| Speculative Decoding Inference | PubMedQA | Throughput (tokens/s)160.6 | 12 | |
| Speculative Decoding Inference | Specialized Datasets Aggregate | Average Speed (tokens/s)152.7 | 12 | |
| Speculative Decoding Inference | Pile of Law | Inference Speed (tokens/s)160.5 | 12 |