Introspective Diffusion Language Models
About
Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | IFEval | IFEval Accuracy84.7 | 836 | |
| Mathematical Reasoning | MATH 500 | Accuracy (Acc)97.6 | 543 | |
| Mathematical Reasoning | AIME 2024 | Accuracy83.3 | 479 | |
| Mathematical Reasoning | GSM8K | Accuracy (Acc)95 | 337 | |
| General Knowledge | MMLU | MMLU General Knowledge Accuracy82.4 | 307 | |
| Knowledge Reasoning | MMLU-Pro | Accuracy79.7 | 120 | |
| Reasoning | ARC Challenge | Accuracy96.8 | 100 | |
| Code Generation | LiveCodeBench v6 | Accuracy57.1 | 75 | |
| Knowledge Reasoning | MMLU | MMLU Knowledge Reasoning Accuracy86.8 | 73 | |
| Knowledge Reasoning | GPQA | Accuracy58.7 | 18 |