Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

About

Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text.

Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Speed Up (x)1.5
246
Arithmetic ReasoningGSM8K--
173
Mathematical ReasoningGSM8K
Tau ($ au$)4.01
97
Multi-turn dialogueMT-Bench
Kendall's Tau3.93
54
Speculative DecodingSpec-Bench
MT Score3.66
48
Multi-turn conversationMT-Bench
SR3.47
43
Code GenerationHumanEval
Success Rate (SR)3.84
43
Code GenerationMT-Bench (test)
Speedup Ratio3.934
26
Machine TranslationWMT German-English 16 (test)
Speedup ratio2.496
26
Question AnsweringNatural Questions (test)
Speedup Ratio2.916
26
Showing 10 of 25 rows

Other info

Follow for update