Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Empirical Analysis of Decoding Biases in Masked Diffusion Models

About

Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes are available at https://github.com/NEUIR/Uncode.

Pengcheng Huang, Tianming Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Tong Xiao, Zulong Chen, Maosong Sun• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy81.5
1398
ReasoningBBH--
726
Code GenerationMBPP
Accuracy46.2
165
PlanningSudoku
Accuracy83.6
129
PlanningCountdown
Accuracy42.4
89
Mathematical ReasoningMATH500
Accuracy46.8
57
Scientific ReasoningGPQA
Accuracy28.8
55
ReasoningBBH
Score54.03
36
Multi-task KnowledgeMMLU-Pro
MMLU-Pro Score33.9
33
TruthfulnessTruthfulQA
Truthfulness Score41.78
16
Showing 10 of 19 rows

Other info

Follow for update