From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs
About
Diffusion Language Models (DLMs) enable fast generation, yet training large DLMs from scratch is costly. As a practical shortcut, adapting off-the-shelf Auto-Regressive (AR) model weights into a DLM could quickly equip the DLM with strong long-context generation capabilies. Prior "adaptation" attempts either modify logits or randomly grow attention masks to Full-Sequence diffusion, or simply transplant AR weights into a Block-Diffusion recipe, leaving two key questions unaddressed: where is the final destination of adaptation, and how to adapt better? For manifold benefits, we reframe the whole AR-to-DLM adaptation under the Block-Diffusion paradigm, transitioning from block size 1 to the final Block-Diffusion state. Concretely, the principled pathway of adaptation is designed as follows: we keep a context-causal path where causal attention is kept in the prefix, an efficient parallel adaptation procedure where an AR guidance is maintained, and gradual increment of the generation block size for a smoother transition. Built on these components, the adaptation is proved competitive on various models at different scales. With better adaptation, we propose NBDiff-7B that could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance among the 7B-class DLMs. Codes: https://github.com/YuchuanTian/NBDiff.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | Pass@189 | 850 | |
| Language Understanding | MMLU | Accuracy82.9 | 756 | |
| Mathematical Reasoning | MATH | Accuracy84 | 643 | |
| Mathematical Reasoning | MATH | Accuracy46 | 535 | |
| Mathematical Reasoning | GSM8K | Accuracy (GSM8K)91 | 358 | |
| Instruction Following | IFEval | Accuracy (0-100)60.8 | 292 | |
| Code Generation | MBPP | Pass@187.6 | 175 | |
| Code Generation | MBPP | Accuracy (%)55.8 | 146 | |
| Logical reasoning | BBH | Accuracy77.3 | 93 | |
| Language Understanding | MMLU-Pro | Accuracy71.9 | 70 |