LLaDA2.0: Scaling Up Diffusion Language Models to 100B
About
This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval (test) | -- | 444 | |
| Code Generation | MBPP (test) | -- | 276 | |
| Reasoning | HellaSwag (HS) | HellaSwag Accuracy84.97 | 142 | |
| Reasoning | PIQA | Accuracy96.5 | 133 | |
| Text-to-SQL | Spider | Exec Acc (All)82.49 | 57 | |
| Function-level Code Generation | HumanEval+ augmented (test) | Pass@187.8 | 46 | |
| Function-level Code Generation | MBPP+ augmented (test) | Pass@179.6 | 45 | |
| Code Generation | BigCodeBench-Completion Full | pass@141.6 | 41 | |
| Coding | HumanEval+ | Pass@188.41 | 31 | |
| Knowledge | MMLU-Pro | Score74.79 | 30 |