Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

JudgeLRM: Large Reasoning Models as a Judge

About

Large Language Models (LLMs) are increasingly adopted as evaluators, offering a scalable alternative to human annotation. However, existing supervised fine-tuning (SFT) approaches often fall short in domains that demand complex reasoning. Judgment is inherently reasoning-intensive: beyond surface-level scoring, it requires verifying evidence, identifying errors, and justifying decisions. Through the analysis of evaluation tasks, we find a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples, revealing the limits of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs, trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards to activate reasoning capabilities. JudgeLRM consistently outperform SFT-tuned baselines in the same size, as well as other RL and SFT variants, and even surpass state-of-the-art reasoning models: notably, JudgeLRM-3B/4B exceeds GPT-4, while JudgeLRM-7B/8B/14B outperforms DeepSeek-R1 by over 2% in F1 score, with particularly strong gains on reasoning-heavy tasks. Our findings underscore the value of RL in unlocking reasoning-aligned LLM judges.

Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He• 2025

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench
Avg Score75.2
118
Reward ModelingRewardBench Focus 2
Accuracy29.1
82
Reward ModelingRewardBench Precise IF 2
Accuracy9.4
70
Reward ModelingRM-Bench
Average Score64.7
53
Reward ModelingJudgeBench (test)
Overall54.6
40
Reward ModelingHelpSteer 3
Accuracy60.2
39
Reward ModelingRM-Bench (test)
Overall Score58.7
39
Reward ModelingRM-Bench Chat Hard
Accuracy56.1
34
Reward ModelingPPE Correctness (test)
PPE Corr42.6
26
Reward ModelingRewardBench (test)
RWBench0.752
25
Showing 10 of 16 rows

Other info

Follow for update