EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

About

Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text.

Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang• 2024

Related benchmarks

Task	Dataset	Result
Arithmetic Reasoning	GSM8K	--	272
Mathematical Reasoning	GSM8K	Speed Up (x)1.5	246
Code Generation	HumanEval	Speedup Factor3.14	147
Mathematical Reasoning	GSM8K	Tau ($ au$)4.01	97
Inference Efficiency	HumanEval	Speedup Factor5.12	90
LLM Inference Acceleration	GSM8K	Speedup5.1	61
Mathematical Reasoning	GSM8K	Average Length3.7629	61
Multimodal Understanding	MMT	Speedup Ratio2.57	60
LLM Inference	Alpaca	Speedup4.99	57
Speculative Decoding	Spec-Bench	MT Score3.66	57

Showing 10 of 49 rows

Other info

Follow for update

@wizwand_team Discord