Diffusion Language Models are Super Data Learners
About
Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves > 56% accuracy on HellaSwag and > 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | -- | 850 | |
| Language Understanding | MMLU | Accuracy48.8 | 756 | |
| Physical Commonsense Reasoning | PIQA | Accuracy67.8 | 329 | |
| Instruction Following | IFEval | Accuracy (0-100)43.3 | 292 | |
| Science Question Answering | ARC-C | Accuracy56.6 | 127 | |
| Code Generation | MBPP | Accuracy29 | 120 | |
| Commonsense Reasoning | Hella | Accuracy44.4 | 12 | |
| Code Generation | LiveCodeBench v1 (test) | Accuracy31.9 | 9 |