Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models
About
Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit this granularity and observe that reliable predictions often emerge as contiguous high-confidence spans, suggesting that the unit of parallel commitment can be larger than a single token. We first group adjacent high-confidence candidates into confidence-induced clusters (CICs) as span-level update units. We then use self-attention maps from the same forward pass to estimate inter-cluster dependencies, enabling conflict-aware selection of mutually compatible CICs for parallel commitment. This yields CLAD (Cluster-Level Attention-Guided Decoding), a training-free cluster-level decoder for MDLMs. Experiments on LLaDA and Dream model families across four reasoning and code-generation benchmarks show that CLAD achieves 1.77x--8.47x speedups over Vanilla decoding while maintaining broadly comparable task accuracy in most settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval 0-shot | Accuracy54.27 | 69 | |
| Mathematical Reasoning | GSM8k 5-shot | Accuracy81.65 | 54 | |
| Math Reasoning | MATH 4-shot | Accuracy37.34 | 33 | |
| Code Generation | MBPP 3-shot | Accuracy54.2 | 33 | |
| Reasoning | GSM8k 5-shot | Accuracy77.79 | 12 |