Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

About

We present a principled approach to provide LLM-based evaluation with a rigorous guarantee of human agreement. We first propose that a reliable evaluation method should not uncritically rely on model preferences for pairwise evaluation, but rather assess the confidence of judge models and selectively decide when to trust its judgement. We then show that under this selective evaluation framework, human agreement can be provably guaranteed -- such that the model evaluation aligns with that of humans to a user-specified agreement level. As part of our framework, we also introduce Simulated Annotators, a novel confidence estimation method that significantly improves judge calibration and thus enables high coverage of evaluated instances. Finally, we propose Cascaded Selective Evaluation, where we use cheaper models as initial judges and escalate to stronger models only when necessary -- again, while still providing a provable guarantee of human agreement. Experimental results show that Cascaded Selective Evaluation guarantees strong alignment with humans, far beyond what LLM judges could achieve without selective evaluation. For example, on a subset of Chatbot Arena where GPT-4 almost never achieves 80% human agreement, our method, even while employing substantially cost-effective models such as Mistral-7B, guarantees over 80% human agreement with almost 80% test coverage.

Jaehun Jung, Faeze Brahman, Yejin Choi• 2024

Related benchmarks

TaskDatasetResultRank
Question AnsweringNQ (test)--
133
Question AnsweringTriviaQA
Correlation2.43e+3
40
Selective Question AnsweringTriviaQA
Score (Coverage 14%)2.47e+3
40
Question AnsweringSQuAD
Correlation (alpha=0.17)2.02e+3
35
Question AnsweringNQ
Correlation Score747
35
Question AnsweringSQuAD v2
Correlation1.84e+3
26
LLM Judgement Confidence EstimationAlpacaEval (test)
Rank Correlation (RK)0.4321
16
LLM Judgement Confidence EstimationChatbot Arena (test)
RK0.3355
16
LLM Judgement Confidence EstimationTL;DR (test)
RK0.4085
16
LLM Judgement Confidence EstimationHH-RLHF (test)
RK0.3986
16
Showing 10 of 18 rows

Other info

Follow for update