Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

About

LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.

Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, Tianlu Wang• 2025

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench
Accuracy93.9
166
Reward ModelingRM-Bench
Accuracy82.1
125
Reward ModelingJudgeBench
Accuracy56.6
105
Reward ModelingRewardBench v1.0 (test)
Average Score0.939
89
Reward ModelingRM-Bench (test)
Overall Score82.1
63
Pair-wise comparisonRewardBench
Accuracy88.7
29
LLM-as-a-judge evaluationJudgeBench (test)
Score56.6
22
Pair-wise comparisonEvalBias
Accuracy74.4
16
Pair-wise comparisonMTBench Human
Accuracy81.4
16
Pair-wise comparisonHelpSteer2
Accuracy65.5
16
Showing 10 of 12 rows

Other info

Follow for update