DiffScore: Text Evaluation Beyond Autoregressive Likelihood

About

Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: https://github.com/wenlai-lavine/DiffScore.

Wen Lai, Yingli Shen, Dingnan Jin, Qing Cui, Jun Zhou, Maosong Sun, Alexander Fraser• 2026

Related benchmarks

Task	Dataset	Result
Data-to-text evaluation	SFRES	Spearman Correlation0.256	39
Machine Translation Evaluation	WMT 2019 (test)	de-en0.327	25
Data-to-text evaluation	SFHOT	Spearman Correlation (Naturalness)0.309	25
Text Summarization	REALSumm system-level	Coverage49.2	15
Text Summarization	QAGS-C	Pearson Correlation Coefficient0.73	15
Text Summarization	QAGS-X	Pearson Correlation0.248	15
Data-to-Text	BAGEL	Informativeness (INF)0.326	15
Text Summarization	Newsroom segment-level	Coherence (COH)0.683	15
Text Summarization	Rank19	ACC83.6	15
Text Summarization	SummEval segment-level	Coherence38.6	15

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord