Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

About

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity. DART substantially reduces draft-stage overhead while preserving high draft accuracy, leading to significantly improved end-to-end decoding speed. Experimental results demonstrate that DART achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at https://github.com/fvliang/DART.

Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, Chen Tian• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingMT-Bench--
287
Instruction FollowingAlpaca
Speedup (x)2.21
173
Inference EfficiencyHumanEval
Speedup Factor3.25
90
Multi-turn conversationMT-Bench
Speedup2.27
76
LLM InferenceAlpaca
Speedup2.95
57
Generative InferenceMT-Bench
Speedup2.73
44
Code GenerationCodeAlpaca
Average Speed-up3.45
41
Code GenerationMBPP
Average Acceptance Length (τ)3.01
37
Code GenerationLCB
Speedup2.25
33
LLM InferenceLiveCodeBench
Speedup2.81
21
Showing 10 of 19 rows

Other info

Follow for update