CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
About
Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Pairwise Evaluation | BIGGEN | Human Agreement68.82 | 41 | |
| Pairwise Evaluation | AlpacaEval | Human Agreement66.82 | 37 | |
| General Utility Evaluation | MT_Bench | Agreement Rate75.24 | 33 | |
| Pointwise evaluation | BIGGEN | Spearman Corr0.424 | 32 | |
| Pointwise evaluation | HelpSteer2 | Spearman Correlation0.375 | 28 | |
| Faithfulness Evaluation | mFACE | Balanced Accuracy (AM)60.8 | 7 | |
| Faithfulness Evaluation | MEMERAG | Balanced Accuracy (DE)76.1 | 7 | |
| Multi-party Travel Planning | MR-TravelBench Hard | Group Utility7.38 | 5 | |
| Multi-party Travel Planning | MR-TravelBench Easy | Group Utility5.21 | 5 | |
| Multi-party Travel Planning | MR-TravelBench Med | Group Utility6.25 | 5 |