Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

About

Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.

Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval (test)--
444
Text-to-Image GenerationGenEval
GenEval Score77.7
277
Text-to-Image GenerationDPG-Bench
DPG Score82.8
89
Code GenerationHumanEval
Tokens/s85.58
61
Inference EfficiencyHumanEval
Speedup Factor3.24
54
Multi-turn dialogueMT-Bench
Kendall's Tau4.8
54
Mathematical ReasoningGSM8K
Tau ($ au$)4.98
54
Speculative DecodingSpec-Bench
MT Score176.8
48
Inference AccelerationSpec-Bench
MAT Score4.36
39
Text-to-Image GenerationMS-COCO 5K 2017 (val)
FID32.4
34
Showing 10 of 39 rows

Other info

Follow for update