Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

One Token Is Enough: Improving Diffusion Language Models with a Sink Token

About

Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer's value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.

Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Yao Hu, Shaosheng Cao• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy61.31
1460
Question AnsweringARC Challenge
Accuracy22.35
749
Commonsense ReasoningPIQA
Accuracy68.17
647
Question AnsweringARC Easy
Normalized Acc40.91
385
Mathematical ReasoningGSM8K
Accuracy (GSM8K)58.45
358
Physical Commonsense ReasoningPIQA
Accuracy62.24
329
Question AnsweringARC-E
Accuracy68.18
242
Language ModelingLAMBADA
Accuracy66.41
183
Question AnsweringARC-C
Accuracy43.43
166
Reading ComprehensionRACE
Accuracy37.61
151
Showing 10 of 12 rows

Other info

Follow for update