Understanding and Accelerating the Training of Masked Diffusion Language Models

About

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.

Chunsan Hong, Sanghyun Lee, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Yuki Mitsufuji, Seungryong Kim, Jong Chul Ye• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval 2.0	Win Rate78.27	752
Commonsense Reasoning	WinoGrande	Accuracy50.36	453
Question Answering	OpenBookQA	Accuracy24.6	319
Text Generation	OpenWebText	Perplexity33.89	187
commonsense inference	HellaSwag	Accuracy39.84	171
Social Commonsense Reasoning	SIQA	Accuracy42.07	118
Language Modeling	LAMBADA	Accuracy17.89	114
Physical Commonsense Reasoning	PIQA	Accuracy (PIQA)59.3	99
Language Modeling	LAMBADA zero-shot (test)	--	44
Language Modeling	PTB zero-shot	Perplexity98.16	35

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord