Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

JudgeLRM: Large Reasoning Models as a Judge

About

Large Language Models (LLMs) are increasingly adopted as evaluators, offering a scalable alternative to human annotation. However, existing supervised fine-tuning (SFT) approaches often fall short in domains that demand complex reasoning. Judgment is inherently reasoning-intensive: beyond surface-level scoring, it requires verifying evidence, identifying errors, and justifying decisions. Through the analysis of evaluation tasks, we find a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples, revealing the limits of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs, trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards to activate reasoning capabilities. JudgeLRM consistently outperform SFT-tuned baselines in the same size, as well as other RL and SFT variants, and even surpass state-of-the-art reasoning models: notably, JudgeLRM-3B/4B exceeds GPT-4, while JudgeLRM-7B/8B/14B outperforms DeepSeek-R1 by over 2% in F1 score, with particularly strong gains on reasoning-heavy tasks. Our findings underscore the value of RL in unlocking reasoning-aligned LLM judges.

Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He• 2025

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench
Accuracy75.2
166
Reward ModelingRewardBench
Chat Score92.9
146
Reward ModelingRM-Bench
Accuracy78.5
125
Reward ModelingRMB
Accuracy73.1
120
Reward ModelingRewardBench Focus 2
Accuracy29.1
82
Reward ModelingRewardBench v2
Accuracy55.6
72
Reward ModelingRewardBench Precise IF 2
Accuracy9.4
70
Reward ModelingRM-Bench (test)
Overall Score58.7
63
Reward ModelingJudgeBench (test)
Overall54.6
40
Reward ModelingHelpSteer 3
Accuracy60.2
39
Showing 10 of 21 rows

Other info

Follow for update