AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators

About

Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.

Michael J. Ryan, Yanzhe Zhang, Amol Salunkhe, Yi Chu, Di Xu, Diyi Yang• 2025

Related benchmarks

Task	Dataset	Result
Human-Metric Correlation	SimpEval In-Distribution	Kendall's Tau0.321	9
Human-Metric Correlation	HelpSteer2 (In-Distribution)	Kendall's Tau0.342	9
Human-Metric Correlation	EvalGen Out-of-Distribution	Kendall's Tau0.382	9
Human-Metric Correlation	RealHumanEval (Out-of-Distribution)	Kendall's Tau0.16	9
Human-Metric Correlation	CoGym (Out-of-Distribution)	Kendall's Tau0.365	9

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord