Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Diffusion Language Models are Super Data Learners

About

Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.

Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh• 2025

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval--
850
Language UnderstandingMMLU
Accuracy48.8
756
Physical Commonsense ReasoningPIQA
Accuracy67.8
329
Instruction FollowingIFEval
Accuracy (0-100)43.3
292
Science Question AnsweringARC-C
Accuracy56.6
127
Code GenerationMBPP
Accuracy29
120
Commonsense ReasoningHella
Accuracy44.4
12
Code GenerationLiveCodeBench v1 (test)
Accuracy31.9
9
Showing 8 of 8 rows

Other info

GitHub

Follow for update