CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

About

Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.

Yukyung Lee, Joonghoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, Najoung Kim• 2024

Related benchmarks

Task	Dataset	Result
Pairwise Evaluation	BIGGEN	Human Agreement68.82	41
Pairwise Evaluation	AlpacaEval	Human Agreement66.82	37
General Utility Evaluation	MT_Bench	Agreement Rate75.24	33
Pointwise evaluation	BIGGEN	Spearman Corr0.424	32
Pointwise evaluation	HelpSteer2	Spearman Correlation0.375	28
Faithfulness Evaluation	mFACE	Balanced Accuracy (AM)60.8	7
Faithfulness Evaluation	MEMERAG	Balanced Accuracy (DE)76.1	7
Multi-party Travel Planning	MR-TravelBench Hard	Group Utility7.38	5
Multi-party Travel Planning	MR-TravelBench Easy	Group Utility5.21	5
Multi-party Travel Planning	MR-TravelBench Med	Group Utility6.25	5

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord