Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

About

Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

Michael J. Ryan, Yanzhe Zhang, Amol Salunkhe, Yi Chu, Di Xu, Diyi Yang• 2025

Related benchmarks

TaskDatasetResultRank
Human-Metric CorrelationSimpEval In-Distribution
Kendall's Tau0.321
9
Human-Metric CorrelationHelpSteer2 (In-Distribution)
Kendall's Tau0.342
9
Human-Metric CorrelationEvalGen Out-of-Distribution
Kendall's Tau0.382
9
Human-Metric CorrelationRealHumanEval (Out-of-Distribution)
Kendall's Tau0.16
9
Human-Metric CorrelationCoGym (Out-of-Distribution)
Kendall's Tau0.365
9
Showing 5 of 5 rows

Other info

Follow for update