DPad: Efficient Diffusion Language Models with Suffix Dropout

About

Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to $\mathbf{61.4\times}$ speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at https://github.com/Crys-Chen/DPad.

Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai "Helen" Li, Yiran Chen• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	ASDIV	Accuracy0.8095	280
Mathematical Reasoning	Countdown	Accuracy25.39	252
Code Generation	HumanEval	Accuracy39.63	224
Mathematical Reasoning	GSM8K	--	220
Instruction Following	IFEval	Accuracy (IFEval)57.86	101

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord