AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning
About
Evaluating LLM agent trajectories is fundamentally task-specific: a code-debugging agent should be judged on Correctness and Error Handling, not on Fluency or Safety. Yet the dominant paradigm -- LLM-as-Judge with a fixed rubric -- applies the same static dimensions regardless of task, producing systematic mis-evaluation. We present AdaRubric, a framework that (i) adaptively generates task-specific evaluation rubrics from task descriptions via LLM, (ii) evaluates agent trajectories step-by-step with confidence-weighted, per-dimension scoring, and (iii) produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter that provably prevents dimension-level quality masking, yield high-quality DPO preference pairs. On WebArena, ToolBench, and AgentBench, AdaRubric achieves Pearson r = 0.79 human correlation (+0.15 over the strongest baseline), with strong reliability (Krippendorff's alpha = 0.83). DPO models trained on AdaRubric-generated pairs improve task success by +6.8-8.5% over the best baseline. AdaRubric also generalises zero-shot to unseen domains (SWE-bench) and extends to multimodal agents (VisualWebArena, OSWorld) without modification. Our code is available at: github.com/alphadl/AdaRubrics
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Web Agent Navigation | WebArena | Success Rate27.8 | 19 | |
| Web Agent Task Success | WebArena | Task Success Rate (TSR)27.8 | 12 | |
| Human Correlation | WebArena | Pearson Correlation Coefficient (r)0.79 | 8 | |
| Human Correlation | ToolBench | Pearson r0.74 | 8 | |
| Human Correlation | AgentBench | Pearson r0.77 | 8 | |
| Success Rate | AgentBench | Success Rate34.1 | 8 | |
| Task Completion Rate | ToolBench | Task Completion Rate (TCR)37.8 | 8 | |
| Evaluation Reliability | Web Automation (WA) | Krippendorff's Alpha0.85 | 6 | |
| Evaluation Reliability | ToolBench (TB) | Krippendorff's Alpha0.82 | 6 | |
| Multimodal Agent Evaluation | VisualWebArena | Pearson r0.76 | 6 |