Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching

About

Diffusion language models (DLMs) generate text through iterative denoising, but inference requires full-sequence attention at every iteration, resulting in substantial redundant computation on masked tokens. Block-wise diffusion can reduce this cost, yet it typically relies on retraining and constrained update orders, limiting its direct applicability to pretrained DLMs. Our token-level analysis reveals pronounced structural locality in DLM inference. Decoding is driven by a small set of prefix-localized active tokens; the influence of distant undecoded context diminishes rapidly, and decoded tokens exhibit stage-wise temporal stability, enabling reuse of intermediate representations except for a brief post-decode transient. Motivated by these observations, we propose \textbf{\placeholder}\footnote{The source code is available at https://github.com/vhicrgit/Window-Diffusion.}, a window-based token pruning and caching method for inference. We maintain a local computation window that slides rightward as denoising progresses, and partition undecoded tokens into: (i) \textit{active tokens} that are computed online, (ii) \textit{buffer tokens} whose KV states are cached and periodically refreshed, and (iii) \textit{far-field tokens} that are pruned outside the window. Computation is restricted to active and buffer tokens within the window, while far-field tokens are omitted at each stage. Experiments on LLaDA and Dream show that, under matched compute budgets, our method achieves up to $99\times$ inference speedup while largely preserving generation performance.

Fengrui Zuo, Zhiwei Ke, Yiming Liu, Wenqi Lou, Chao Wang, Xuehai Zhou• 2026

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval (test)	--	612
Code Generation	MBPP (test)	--	405
Mathematical Reasoning	GSM8K	Speed Up (x)5.7	246
Code Generation	MBPP	Accuracy55.4	89
Code Generation	HumanEval	Accuracy58.5	51
Mathematical Reasoning	GSM8K (test)	Relative Speedup5.6	17
Mathematics	MATH	Accuracy39.2	10
Mathematical Reasoning	MATH (test)	Accuracy26.2	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord