Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

About

Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales-particularly the industrial-grade Qwen3-235B-demonstrate that ECHO consistently outperforms SOTA methods in both low-load and high-load scenarios, achieving up to 5.35x walltime speedup and delivering over 20% relative speedup gain.

Xinyi Hu, Yuhao Shen, Baolin Zhang, Hengxin Zhang, Jun Dai, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpaca
Speedup (x)4.13
173
SummarizationCNN/DM
MAT Score6.22
30
Multi-turn dialogueMT-Bench
MAT Score6.35
30
Code GenerationHumanEval
MAT Score8.35
26
Showing 4 of 4 rows

Other info

Follow for update