SparseD: Sparse Attention for Diffusion Language Models

About

While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps.

Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, Xinchao Wang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy (Acc)77.48	352
Mathematical Reasoning	ASDIV	Accuracy0.8117	280
Mathematical Reasoning	Countdown	Accuracy25.78	252
Code Generation	HumanEval	Accuracy40.24	224
Mathematical Reasoning	GSM8K	--	220
Instruction Following	IFEval	Accuracy (IFEval)57.67	101
Long-context Understanding	LongBench	HotpotQA50.52	82
Code Generation	HumanEval	pass@154.87	50
Long-context Understanding	RULER	Accuracy90.95	38
Long-context Understanding	RULER 8k	Accuracy66.16	14

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord