LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation
About
Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Machine Translation Evaluation | WMT 21 | Score (en-ha)54.3 | 6 | |
| Machine Translation Evaluation | WMT 24 | Quality (cs-uk)0.945 | 6 | |
| NLG Meta-evaluation | CUS-QA en cs | Kendall Correlation0.73 | 6 | |
| NLG Meta-evaluation | CUS-QA en (sk) | Kendall Correlation0.661 | 6 | |
| NLG Meta-evaluation | CUS-QA en uk | Kendall Correlation0.577 | 6 | |
| NLG Meta-evaluation | CUS-QA orig. (cs) | Kendall Correlation0.804 | 6 | |
| NLG Meta-evaluation | CUS-QA orig. (sk) | Kendall Correlation0.788 | 6 | |
| NLG Meta-evaluation | CUS-QA orig. (uk) | Kendall Correlation0.681 | 6 | |
| NLG Meta-evaluation | MOCHA | Kendall Correlation0.72 | 6 | |
| NLG Meta-evaluation | RoSE | Kendall Correlation0.635 | 6 |