LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

About

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose LLM as a Meta-Judge, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.

Luk\'a\v{s} Eigler, Jind\v{r}ich Libovick\'y, David Hurych• 2026

Related benchmarks

Task	Dataset	Result
Machine Translation Evaluation	WMT 21	Score (en-ha)54.3	6
Machine Translation Evaluation	WMT 24	Quality (cs-uk)0.945	6
NLG Meta-evaluation	CUS-QA en cs	Kendall Correlation0.73	6
NLG Meta-evaluation	CUS-QA en (sk)	Kendall Correlation0.661	6
NLG Meta-evaluation	CUS-QA en uk	Kendall Correlation0.577	6
NLG Meta-evaluation	CUS-QA orig. (cs)	Kendall Correlation0.804	6
NLG Meta-evaluation	CUS-QA orig. (sk)	Kendall Correlation0.788	6
NLG Meta-evaluation	CUS-QA orig. (uk)	Kendall Correlation0.681	6
NLG Meta-evaluation	MOCHA	Kendall Correlation0.72	6
NLG Meta-evaluation	RoSE	Kendall Correlation0.635	6

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord