Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
About
Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled judge trajectories, matches the performance of distilled variants, demonstrating that tool-augmented judges can self-evolve through iterative reinforcement learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reward Modeling | JudgeBench (test) | Overall70.4 | 40 | |
| Reward Modeling | RM-Bench (test) | Overall Score76.7 | 39 | |
| Reward Modeling | PPE Correctness (test) | PPE Corr71 | 26 | |
| Reward Modeling | RewardBench (test) | RWBench0.814 | 25 | |
| Listwise Judging | RewardBench listwise 2 | IF Score58.1 | 10 | |
| Reward Modeling | Overall Performance (test) | Overall73.8 | 9 | |
| Reward Modeling | RewardBench 2 (test) | RWBench2 Score73.4 | 9 | |
| Code Generation | BigCodeBench Full 1.0 | -- | 3 | |
| Instruction Following | IFEval 1.0 (full) | -- | 2 |