Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting
About
Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully conditioned sequences, committing to a single greedy path often fails to capture the target model's preferred trajectory. To address this, we propose BASTION, a budget-aware speculative decoding framework with tree-based diffusion drafting. Unlike existing methods that rely on static tree topologies, BASTION dynamically constructs query-dependent trees by balancing draft quality against hardware constraints. Our framework integrates three synergistic components: (1) an acceptance surrogate that estimates expected accepted length via path confidence, (2) an online latency estimator that calibrates a hardware-aware roofline model, and (3) an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. Across diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61x speedup over standard autoregressive decoding, outperforming state-of-the-art block-diffusion baselines by 39%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | Alpaca | Speedup (x)3.56 | 173 | |
| Chat | MT-Bench | -- | 73 | |
| Code Generation | MBPP | Average Acceptance Length (τ)8.99 | 37 | |
| Code Generation | LCB | Speedup7.81 | 33 | |
| General Evaluation | Average Across all Benchmarks | Speedup6.9 | 28 | |
| Chat | Alpaca | Speedup3.59 | 12 | |
| Code | MBPP | Speedup7.68 | 12 | |
| Code | LCB | Speedup8.37 | 12 | |
| Math | GSM8K | Speedup8.08 | 12 | |
| Math | MATH500 | Speedup8.5 | 12 |