Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

About

Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, multi-turn chat, etc. Code is available at https://github.com/baaivision/JudgeLM.

Lianghui Zhu, Xinggang Wang, Xinlong Wang• 2023

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningCSQA
Accuracy59.84
366
Pointwise GradingAlignBench
Pearson (r)0.984
38
Pairwise ComparisonAlignBench
Agreement42.5
18
LLM-as-a-JudgeFairJudge Benchmark 1K (test)
Agreement69.56
13
LLM-as-a-JudgeJudgeLM (test)
Agreement78
13
LLM-as-a-JudgePandaLM Human Annotations (test)
Agreement0.6677
13
LLM EvaluationPandaLM
Accuracy66.97
12
Reward Modeling EvaluationReward-Bench
Agreement46.57
12
Pairwise ComparisonLLMEval
Agreement0.4477
10
Pairwise ComparisonAUTO-J Eval-P
Agreement35.13
10
Showing 10 of 10 rows

Other info

Follow for update