Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Scaling of Diffusion Language Models

About

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we find that the functional form of the subtokenizer significantly increases the cross-entropy loss in the objective when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. Second, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. To address these limitations, we analyze the optimal design of the subtokenizer that minimizes MDM-Prime training objective and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our analysis characterizes how token granularity and sub-token entropy influence the training objective and downstream performance, providing principled criteria for subtokenizer design. When extending the model size to 1.1B parameters, MDM-Prime-v2 demonstrates superior average zero-shot accuracy across eight commonsense reasoning benchmarks, outperforming similar-sized baselines including GPT-Neo, OPT, Pythia, Bloom, SMDM, and TinyLLaMA.

Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan• 2026

Related benchmarks

TaskDatasetResultRank
Common Sense ReasoningBoolQ
Accuracy62.05
240
Commonsense ReasoningOBQA
Accuracy34
187
Commonsense ReasoningSocialIQA
Accuracy42.02
158
Commonsense ReasoningARC-E
Accuracy47.81
152
Language ModelingPTB (val)
Perplexity20.26
107
Language ModelingLM1B (val)
Perplexity16.57
67
Language ModelingWikiText (val)
Perplexity12.51
62
Language ModelingOpenWebText (OWT) (val)
Perplexity7.77
42
Language ModelingLAMBADA (val)
Perplexity12.37
39
Language ModelingAG News (val)
Perplexity27.79
36
Showing 10 of 15 rows

Other info

GitHub

Follow for update