Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

About

While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) <MASK> tokens inherent to the paradigm, and 2) <PAD> tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism during generation due to insertion that dynamically adjusts token positions. To train DID, we design a score-based approach that assigns scores to token insertion operations and derive appropriate training objectives. The objectives involve subsequence counting problems, which we efficiently solve via a parallelized dynamic programming algorithm. Our experiments across fixed and variable-length settings demonstrate the advantage of DID over baselines of MDLMs and existing insertion-based LMs, in terms of modeling performance, sampling quality, and training/inference speed, without any hyperparameter tuning.

Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, Jiacheng Sun• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	PTB	Perplexity87.09	1234
Language Modeling	WikiText	PPL28.35	740
Language Modeling	LAMBADA	--	412
Language Modeling	OpenWebText	Perplexity85.35	122
Language Modeling	Pubmed	Perplexity38.71	59
Language Modeling	arXiv	Perplexity61.77	58
Language Modeling	AG-News	PPL48.84	39
Language Modeling	LM1B	Perplexity58.05	39
Conditional Generation	OWT	Perplexity (PPL)19.99	24
Variable-length Language Modeling	Stories	PPL21.07	12

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord