Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

About

Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model's self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.

Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing, Jiaheng Zhang, Hao Chen, Chunhua Shen• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy37.4	442
Mathematical Reasoning	Countdown	Accuracy27.73	252
Uncertainty Estimation	GSM8K	--	41
Uncertainty Estimation	SVAMP	ROC-AUC (128)68.8	8
Uncertainty Quantification	Countdown	ROC-AUC (128)0.61	8
Uncertainty Quantification	Countdown, GSM8K, MATH500, SVAMP Combined	Average ROC-AUC63.7	8
Uncertainty Quantification	MATH500	ROC-AUC (Threshold 128)61.1	8
Conditional Likelihood Estimation	ARC Challenge	Accuracy56.7	7
Conditional Likelihood Estimation	GPQA	Accuracy30.1	7
Mathematical Reasoning	Countdown, GSM8K, MATH500, and SVAMP	Accuracy54.92	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord