Empirical Analysis of Decoding Biases in Masked Diffusion Models

About

Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes are available at https://github.com/NEUIR/Uncode.

Pengcheng Huang, Tianming Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Tong Xiao, Zulong Chen, Maosong Sun• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy81.5	1398
Reasoning	BBH	--	726
Code Generation	MBPP	Accuracy46.2	165
Planning	Sudoku	Accuracy83.6	129
Planning	Countdown	Accuracy42.4	89
Mathematical Reasoning	MATH500	Accuracy46.8	57
Scientific Reasoning	GPQA	Accuracy28.8	55
Reasoning	BBH	Score54.03	36
Multi-task Knowledge	MMLU-Pro	MMLU-Pro Score33.9	33
Truthfulness	TruthfulQA	Truthfulness Score41.78	16

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord