EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
About
Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Speed Up (x)1.5 | 246 | |
| Arithmetic Reasoning | GSM8K | -- | 173 | |
| Mathematical Reasoning | GSM8K | Tau ($ au$)4.01 | 97 | |
| Multi-turn dialogue | MT-Bench | Kendall's Tau3.93 | 54 | |
| Speculative Decoding | Spec-Bench | MT Score3.66 | 48 | |
| Multi-turn conversation | MT-Bench | SR3.47 | 43 | |
| Code Generation | HumanEval | Success Rate (SR)3.84 | 43 | |
| Code Generation | MT-Bench (test) | Speedup Ratio3.934 | 26 | |
| Machine Translation | WMT German-English 16 (test) | Speedup ratio2.496 | 26 | |
| Question Answering | Natural Questions (test) | Speedup Ratio2.916 | 26 |