Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

About

This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judges to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior evaluation instruction-following capabilities; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong evaluation biases. To mitigate this bias vulnerability, we propose PlanJudge, a lightweight evaluation strategy that prompts the model to generate an explicit evaluation plan before executing the judgment. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in LLM-as-a-Judge while preserving overall judgment accuracy.

Hui Huang, Xuanxin Wu, Muyun Yang, Yuki Arase• 2026

Related benchmarks

TaskDatasetResultRank
LLM-as-a-JudgeRewardBench--
31
LLM-as-a-JudgeJudgeBench--
29
Instruction FollowingHelpsteer2 Trivial--
8
LLM-as-a-Judge Robustness to Adversarial AttacksRobustJudge--
8
Robustness EvaluationBiasBench--
8
Robustness EvaluationLLMBar--
8
Showing 6 of 6 rows

Other info

Follow for update