Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

About

Evaluating LLM agent trajectories is fundamentally task-specific: a code-debugging agent should be judged on Correctness and Error Handling, not on Fluency or Safety. Yet the dominant paradigm -- LLM-as-Judge with a fixed rubric -- applies the same static dimensions regardless of task, producing systematic mis-evaluation. We present AdaRubric, a framework that (i) adaptively generates task-specific evaluation rubrics from task descriptions via LLM, (ii) evaluates agent trajectories step-by-step with confidence-weighted, per-dimension scoring, and (iii) produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter that provably prevents dimension-level quality masking, yield high-quality DPO preference pairs. On WebArena, ToolBench, and AgentBench, AdaRubric achieves Pearson r = 0.79 human correlation (+0.15 over the strongest baseline), with strong reliability (Krippendorff's alpha = 0.83). DPO models trained on AdaRubric-generated pairs improve task success by +6.8-8.5% over the best baseline. AdaRubric also generalises zero-shot to unseen domains (SWE-bench) and extends to multimodal agents (VisualWebArena, OSWorld) without modification. Our code is available at: github.com/alphadl/AdaRubrics

Liang Ding• 2026

Related benchmarks

TaskDatasetResultRank
Web Agent NavigationWebArena
Success Rate27.8
19
Web Agent Task SuccessWebArena
Task Success Rate (TSR)27.8
12
Human CorrelationWebArena
Pearson Correlation Coefficient (r)0.79
8
Human CorrelationToolBench
Pearson r0.74
8
Human CorrelationAgentBench
Pearson r0.77
8
Success RateAgentBench
Success Rate34.1
8
Task Completion RateToolBench
Task Completion Rate (TCR)37.8
8
Evaluation ReliabilityWeb Automation (WA)
Krippendorff's Alpha0.85
6
Evaluation ReliabilityToolBench (TB)
Krippendorff's Alpha0.82
6
Multimodal Agent EvaluationVisualWebArena
Pearson r0.76
6
Showing 10 of 12 rows

Other info

Follow for update