JudgeLRM: Large Reasoning Models as a Judge
About
Large Language Models (LLMs) are increasingly adopted as evaluators, offering a scalable alternative to human annotation. However, existing supervised fine-tuning (SFT) approaches often fall short in domains that demand complex reasoning. Judgment is inherently reasoning-intensive: beyond surface-level scoring, it requires verifying evidence, identifying errors, and justifying decisions. Through the analysis of evaluation tasks, we find a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples, revealing the limits of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs, trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards to activate reasoning capabilities. JudgeLRM consistently outperform SFT-tuned baselines in the same size, as well as other RL and SFT variants, and even surpass state-of-the-art reasoning models: notably, JudgeLRM-3B/4B exceeds GPT-4, while JudgeLRM-7B/8B/14B outperforms DeepSeek-R1 by over 2% in F1 score, with particularly strong gains on reasoning-heavy tasks. Our findings underscore the value of RL in unlocking reasoning-aligned LLM judges.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reward Modeling | RewardBench | Avg Score75.2 | 118 | |
| Reward Modeling | RewardBench Focus 2 | Accuracy29.1 | 82 | |
| Reward Modeling | RewardBench Precise IF 2 | Accuracy9.4 | 70 | |
| Reward Modeling | RM-Bench | Average Score64.7 | 53 | |
| Reward Modeling | JudgeBench (test) | Overall54.6 | 40 | |
| Reward Modeling | HelpSteer 3 | Accuracy60.2 | 39 | |
| Reward Modeling | RM-Bench (test) | Overall Score58.7 | 39 | |
| Reward Modeling | RM-Bench Chat Hard | Accuracy56.1 | 34 | |
| Reward Modeling | PPE Correctness (test) | PPE Corr42.6 | 26 | |
| Reward Modeling | RewardBench (test) | RWBench0.752 | 25 |