FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
About
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speculative Decoding | Spec-Bench | MT Score195.6 | 48 | |
| Question Answering | QA | Speedup Factor2.02 | 17 | |
| Language Model Decoding | Spec-Bench | Conv. Acc234.2 | 11 | |
| Speculative Decoding Throughput | Spec-Bench | Throughput (Conv.)474 | 10 | |
| Decoding | Multi-task Evaluation Suite Llama-3.2-1B (test) | MT Throughput (token/s)394.8 | 6 | |
| Speculative Decoding | Spec-Bench OLMo 2 7B | Conversation Score4.72 | 5 | |
| Code Generation | Code | Throughput (token/s)183.5 | 3 | |
| Conversation | Conv. | Throughput (token/s)212.1 | 3 | |
| Machine Translation | MT | Throughput (tokens/s)188.7 | 3 | |
| Mathematical Reasoning | MATH | Throughput (tokens/s)238 | 3 |