TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

About

Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. However, this shifts the bottleneck to verification: verifying a single sequence limits acceptance length, while verifying large draft trees incurs excessive target-model latency. We identify a key mismatch in existing draft-tree methods: existing diffusion-tree methods rank nodes by the marginal probability, ignoring that verification is prefix-conditioned. As a result, they may verify unreachable descendants of rejected prefixes, increasing latency with limited acceptance gains. To address this, we propose TAPS, a target-aware prefix selection method that turns diffusion marginals into path-conditioned acceptance estimates. TAPS then selects a compact prefix-closed subtree under a fixed verification budget, improving the acceptance-cost tradeoff rather than simply expanding the draft tree. Experiments across diverse datasets and model families demonstrate that TAPS achieves up to 7.9x lossless end-to-end speedup over vanilla autoregressive decoding, outperforming state-of-the-art DFlash and DDTree by 1.36x and 1.74x respectively. Our work is available at https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD

Zhuoyu Wang, Junnan Huang, Xinyu Chen• 2026

Related benchmarks

Task	Dataset	Result
Speculative Decoding	GSM8K	Average Generation Length (τ)8.26	109
Speculative Decoding	MT-Bench	Tau (τ)5.36	81
Speculative Decoding	LiveCodeBench	Speedup Factor7.16	66
Speculative Decoding	HumanEval	Speedup Factor6.99	52
Speculative Decoding	MBPP	Speedup6.75	52
Speculative Decoding	MATH 500	Speedup7.9	52
Speculative Decoding	AVG	Speedup6.73	32
Speculative Decoding	AIME 25	Speedup7.08	26

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord