DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference
About
Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity. DART substantially reduces draft-stage overhead while preserving high draft accuracy, leading to significantly improved end-to-end decoding speed. Experimental results demonstrate that DART achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at https://github.com/fvliang/DART.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Inference Efficiency | HumanEval | Speedup Factor3.25 | 54 | |
| Code Generation | CodeAlpaca | Average Speed-up3.45 | 41 | |
| Generative Inference | MT-Bench | Speedup2.73 | 26 | |
| LLM Inference | Alpaca | Speedup2.95 | 21 | |
| LLM Inference | LiveCodeBench | Speedup2.81 | 21 | |
| LLM Inference | MATH500 | Speedup2.84 | 21 | |
| LLM Inference | MBPP | Speedup3.09 | 21 | |
| LLM Inference | Aggregate Mean over Alpaca, CodeAlpaca, HumanEval, LiveCodeBench, Math500, MBPP, MT-Bench | Mean Speedup2.87 | 21 | |
| LLM Inference | Alpaca, CodeAlpaca, HumanEval, LiveCodeBench, Math500, MBPP, and MT-Bench | Speedup (Alpaca)2.61 | 8 |