Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

About

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.

Luk\'a\v{s} Eigler, Jind\v{r}ich Libovick\'y, David Hurych• 2026

Related benchmarks

TaskDatasetResultRank
Machine Translation EvaluationWMT 21
Score (en-ha)54.3
6
Machine Translation EvaluationWMT 24
Quality (cs-uk)0.945
6
NLG Meta-evaluationCUS-QA en cs
Kendall Correlation0.73
6
NLG Meta-evaluationCUS-QA en (sk)
Kendall Correlation0.661
6
NLG Meta-evaluationCUS-QA en uk
Kendall Correlation0.577
6
NLG Meta-evaluationCUS-QA orig. (cs)
Kendall Correlation0.804
6
NLG Meta-evaluationCUS-QA orig. (sk)
Kendall Correlation0.788
6
NLG Meta-evaluationCUS-QA orig. (uk)
Kendall Correlation0.681
6
NLG Meta-evaluationMOCHA
Kendall Correlation0.72
6
NLG Meta-evaluationRoSE
Kendall Correlation0.635
6
Showing 10 of 19 rows

Other info

Follow for update