JudgeLRM: Large Reasoning Models as a Judge

About

Large Language Models (LLMs) are increasingly adopted as evaluators, offering a scalable alternative to human annotation. However, existing supervised fine-tuning (SFT) approaches often fall short in domains that demand complex reasoning. Judgment is inherently reasoning-intensive: beyond surface-level scoring, it requires verifying evidence, identifying errors, and justifying decisions. Through the analysis of evaluation tasks, we find a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples, revealing the limits of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs, trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards to activate reasoning capabilities. JudgeLRM consistently outperform SFT-tuned baselines in the same size, as well as other RL and SFT variants, and even surpass state-of-the-art reasoning models: notably, JudgeLRM-3B/4B exceeds GPT-4, while JudgeLRM-7B/8B/14B outperforms DeepSeek-R1 by over 2% in F1 score, with particularly strong gains on reasoning-heavy tasks. Our findings underscore the value of RL in unlocking reasoning-aligned LLM judges.

Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He• 2025

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench	Chat Score92.9	216
Reward Modeling	RewardBench	Accuracy75.2	166
Reward Modeling	RM-Bench	Accuracy78.5	137
Reward Modeling	RMB	Accuracy73.1	120
Instruction Following	FollowBench	--	85
Reward Modeling	RewardBench Focus 2	Accuracy29.1	82
Reward Modeling	RewardBench v2	Accuracy55.6	72
Reward Modeling	RewardBench Precise IF 2	Accuracy9.4	70
Reward Modeling	RM-Bench (test)	Overall Score58.7	63
Reward Modeling	HelpSteer 3	Accuracy60.2	62

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord