Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

About

Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully conditioned sequences, committing to a single greedy path often fails to capture the target model's preferred trajectory. To address this, we propose BASTION, a budget-aware speculative decoding framework with tree-based diffusion drafting. Unlike existing methods that rely on static tree topologies, BASTION dynamically constructs query-dependent trees by balancing draft quality against hardware constraints. Our framework integrates three synergistic components: (1) an acceptance surrogate that estimates expected accepted length via path confidence, (2) an online latency estimator that calibrates a hardware-aware roofline model, and (3) an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. Across diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61x speedup over standard autoregressive decoding, outperforming state-of-the-art block-diffusion baselines by 39%.

Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpaca
Speedup (x)3.56
173
ChatMT-Bench--
73
Code GenerationMBPP
Average Acceptance Length (τ)8.99
37
Code GenerationLCB
Speedup7.81
33
General EvaluationAverage Across all Benchmarks
Speedup6.9
28
ChatAlpaca
Speedup3.59
12
CodeMBPP
Speedup7.68
12
CodeLCB
Speedup8.37
12
MathGSM8K
Speedup8.08
12
MathMATH500
Speedup8.5
12
Showing 10 of 12 rows

Other info

Follow for update