Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases
About
This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judges to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior evaluation instruction-following capabilities; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong evaluation biases. To mitigate this bias vulnerability, we propose PlanJudge, a lightweight evaluation strategy that prompts the model to generate an explicit evaluation plan before executing the judgment. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in LLM-as-a-Judge while preserving overall judgment accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| LLM-as-a-Judge | RewardBench | -- | 31 | |
| LLM-as-a-Judge | JudgeBench | -- | 29 | |
| Instruction Following | Helpsteer2 Trivial | -- | 8 | |
| LLM-as-a-Judge Robustness to Adversarial Attacks | RobustJudge | -- | 8 | |
| Robustness Evaluation | BiasBench | -- | 8 | |
| Robustness Evaluation | LLMBar | -- | 8 |