Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models

About

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit this granularity and observe that reliable predictions often emerge as contiguous high-confidence spans, suggesting that the unit of parallel commitment can be larger than a single token. We first group adjacent high-confidence candidates into confidence-induced clusters (CICs) as span-level update units. We then use self-attention maps from the same forward pass to estimate inter-cluster dependencies, enabling conflict-aware selection of mutually compatible CICs for parallel commitment. This yields CLAD (Cluster-Level Attention-Guided Decoding), a training-free cluster-level decoder for MDLMs. Experiments on LLaDA and Dream model families across four reasoning and code-generation benchmarks show that CLAD achieves 1.77x--8.47x speedups over Vanilla decoding while maintaining broadly comparable task accuracy in most settings.

Heqiang Qi, Wei Huang, Mingyuan Bai, Xiangming Meng• 2026

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval 0-shot	Accuracy54.27	100
Mathematical Reasoning	GSM8k 5-shot	Accuracy81.65	82
Code Generation	MBPP 3-shot	Accuracy54.2	57
Math Reasoning	MATH 4-shot	Accuracy37.34	45
Reasoning	GSM8k 5-shot	Accuracy77.79	12

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord