TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

About

Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured generation, where the draft model constructs a tree of candidate tokens to explore multiple possible drafts in parallel. However, existing tree-based SD methods typically build a fixed-width, fixed-depth draft tree, which fails to adapt to the varying difficulty of tokens and contexts. As a result, the draft model cannot dynamically adjust the tree structure to early stop on difficult tokens and extend generation for simple ones. To address these challenges, we introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods. Unlike static methods, TALON constructs the draft tree iteratively until a fixed token budget is met, using a hybrid expansion strategy that adaptively allocates the node budget to each layer of the draft tree. This framework naturally shapes the draft tree into a "deep-and-narrow" form for deterministic contexts and a "shallow-and-wide" form for uncertain branches, effectively optimizing the trade-off between exploration width and generation depth under a given budget. Extensive experiments across 5 models and 6 datasets demonstrate that TALON consistently outperforms state-of-the-art EAGLE-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.

Tianyu Liu, Qitan Lv, Yuhao Shen, Xiao Sun, Xiaoyan Sun• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Speed Up (x)4.43	246
Instruction Following	Alpaca	Speedup (x)4.04	173
Question Answering	QA	Speedup Factor3.27	47
Summarization	CNN/DM	Speedup3.58	32
Code Generation	HumanEval	MAT9.48	14
Multi-turn conversation	MT-Bench	MAT Score7.29	14

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord