Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Understanding and Accelerating the Training of Masked Diffusion Language Models

About

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.

Chunsan Hong, Sanghyun Lee, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Yuki Mitsufuji, Seungryong Kim, Jong Chul Ye• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpacaEval 2.0
Win Rate78.27
722
Commonsense ReasoningWinoGrande
Accuracy50.36
453
Question AnsweringOpenBookQA
Accuracy24.6
305
Text GenerationOpenWebText
Perplexity33.89
142
commonsense inferenceHellaSwag
Accuracy39.84
123
Social Commonsense ReasoningSIQA
Accuracy42.07
112
Language ModelingLAMBADA
Accuracy17.89
103
Physical Commonsense ReasoningPIQA
Accuracy (PIQA)59.3
99
Language ModelingLAMBADA zero-shot (test)--
44
Language ModelingPTB zero-shot
Perplexity98.16
25
Showing 10 of 18 rows

Other info

Follow for update