Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

About

Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model's self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.

Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing, Jiaheng Zhang, Hao Chen, Chunhua Shen• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy37.4
391
Mathematical ReasoningCountdown
Accuracy27.73
126
Uncertainty EstimationSVAMP
ROC-AUC (128)68.8
8
Uncertainty QuantificationCountdown
ROC-AUC (128)0.61
8
Uncertainty QuantificationCountdown, GSM8K, MATH500, SVAMP Combined
Average ROC-AUC63.7
8
Uncertainty QuantificationMATH500
ROC-AUC (Threshold 128)61.1
8
Uncertainty EstimationGSM8K
ROC-AUC (128)0.633
8
Conditional Likelihood EstimationARC Challenge
Accuracy56.7
7
Conditional Likelihood EstimationGPQA
Accuracy30.1
7
Mathematical ReasoningCountdown, GSM8K, MATH500, and SVAMP
Accuracy54.92
6
Showing 10 of 10 rows

Other info

Follow for update