EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
About
Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval (test) | -- | 444 | |
| Text-to-Image Generation | GenEval | GenEval Score77.7 | 277 | |
| Text-to-Image Generation | DPG-Bench | DPG Score82.8 | 89 | |
| Code Generation | HumanEval | Tokens/s85.58 | 61 | |
| Inference Efficiency | HumanEval | Speedup Factor3.24 | 54 | |
| Multi-turn dialogue | MT-Bench | Kendall's Tau4.8 | 54 | |
| Mathematical Reasoning | GSM8K | Tau ($ au$)4.98 | 54 | |
| Speculative Decoding | Spec-Bench | MT Score176.8 | 48 | |
| Inference Acceleration | Spec-Bench | MAT Score4.36 | 39 | |
| Text-to-Image Generation | MS-COCO 5K 2017 (val) | FID32.4 | 34 |