Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

About

This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judges to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior evaluation instruction-following capabilities; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong evaluation biases. To mitigate this bias vulnerability, we propose PlanJudge, a lightweight evaluation strategy that prompts the model to generate an explicit evaluation plan before executing the judgment. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in LLM-as-a-Judge while preserving overall judgment accuracy.

Hui Huang, Xuanxin Wu, Muyun Yang, Yuki Arase• 2026

Related benchmarks

Task	Dataset	Result
LLM-as-a-Judge	RewardBench	--	31
LLM-as-a-Judge	JudgeBench	--	29
Instruction Following	Helpsteer2 Trivial	--	8
LLM-as-a-Judge Robustness to Adversarial Attacks	RobustJudge	--	8
Robustness Evaluation	BiasBench	--	8
Robustness Evaluation	LLMBar	--	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord