Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

About

Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully conditioned sequences, committing to a single greedy path often fails to capture the target model's preferred trajectory. To address this, we propose BASTION, a budget-aware speculative decoding framework with tree-based diffusion drafting. Unlike existing methods that rely on static tree topologies, BASTION dynamically constructs query-dependent trees by balancing draft quality against hardware constraints. Our framework integrates three synergistic components: (1) an acceptance surrogate that estimates expected accepted length via path confidence, (2) an online latency estimator that calibrates a hardware-aware roofline model, and (3) an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. Across diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61x speedup over standard autoregressive decoding, outperforming state-of-the-art block-diffusion baselines by 39%.

Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	Alpaca	Speedup (x)3.56	173
Code Generation	MBPP	Average Acceptance Length (τ)8.99	95
Code Generation	LCB	Speedup7.81	75
Chat	MT-Bench	--	73
General Evaluation	Average Across all Benchmarks	Speedup6.9	28
Math	GSM8K	Speedup8.08	14
Math	MATH500	Speedup8.5	14
Chat	Alpaca	Speedup3.59	12
Code	MBPP	Speedup7.68	12
Code	LCB	Speedup8.37	12

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord