One Token Is Enough: Improving Diffusion Language Models with a Sink Token

About

Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer's value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.

Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Yao Hu, Shaosheng Cao• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy61.31	1896
Question Answering	ARC Challenge	Accuracy22.35	906
Commonsense Reasoning	PIQA	Accuracy68.17	757
Physical Commonsense Reasoning	PIQA	Accuracy62.24	696
Question Answering	ARC-E	Accuracy68.18	523
Language Modeling	LAMBADA	Accuracy66.41	412
Question Answering	ARC Easy	Normalized Acc40.91	391
Mathematical Reasoning	GSM8K	Accuracy (GSM8K)58.45	358
Question Answering	ARC-C	Accuracy43.43	258
Commonsense Reasoning	SIQA	Accuracy39.97	168

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord