AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

About

Evaluating LLM agent trajectories is fundamentally task-specific: a code-debugging agent should be judged on Correctness and Error Handling, not on Fluency or Safety. Yet the dominant paradigm -- LLM-as-Judge with a fixed rubric -- applies the same static dimensions regardless of task, producing systematic mis-evaluation. We present AdaRubric, a framework that (i) adaptively generates task-specific evaluation rubrics from task descriptions via LLM, (ii) evaluates agent trajectories step-by-step with confidence-weighted, per-dimension scoring, and (iii) produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter that provably prevents dimension-level quality masking, yield high-quality DPO preference pairs. On WebArena, ToolBench, and AgentBench, AdaRubric achieves Pearson r = 0.79 human correlation (+0.15 over the strongest baseline), with strong reliability (Krippendorff's alpha = 0.83). DPO models trained on AdaRubric-generated pairs improve task success by +6.8-8.5% over the best baseline. AdaRubric also generalises zero-shot to unseen domains (SWE-bench) and extends to multimodal agents (VisualWebArena, OSWorld) without modification. Our code is available at: github.com/alphadl/AdaRubrics

Liang Ding• 2026

Related benchmarks

Task	Dataset	Result
Web Agent Navigation	WebArena	Success Rate27.8	19
Web Agent Task Success	WebArena	Task Success Rate (TSR)27.8	12
Human Correlation	WebArena	Pearson Correlation Coefficient (r)0.79	8
Human Correlation	ToolBench	Pearson r0.74	8
Human Correlation	AgentBench	Pearson r0.77	8
Success Rate	AgentBench	Success Rate34.1	8
Task Completion Rate	ToolBench	Task Completion Rate (TCR)37.8	8
Evaluation Reliability	Web Automation (WA)	Krippendorff's Alpha0.85	6
Evaluation Reliability	ToolBench (TB)	Krippendorff's Alpha0.82	6
Multimodal Agent Evaluation	VisualWebArena	Pearson r0.76	6

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord