Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Human-Metric Correlation on EvalGen Out-of-Distribution
Loading...
0.382
Kendall's Tau
AutoMetrics
-0.23784
-0.07692
0.084
0.24492
Dec 19, 2025
Kendall's Tau
Updated 4d ago
Evaluation Results
Method
Method
Links
Kendall's Tau
AutoMetrics
Backbone=Qwen-3-32B
2025.12
0.382
AutoMetrics
Backbone=GPT-4o-mini
2025.12
0.334
LLM-Judge
Backbone=Qwen-3-32B
2025.12
0.272
DnA Eval
Backbone=Qwen-3-32B
2025.12
0.232
Best Existing Metric
Backbone=Model Agnostic
2025.12
0.193
DnA Eval
Backbone=GPT-4o-mini
2025.12
0.174
LLM-Judge
Backbone=GPT-4o-mini
2025.12
0.161
Finetuned LLM
Backbone=Model Agnostic
2025.12
0.054
MetaMetrics
Backbone=Model Agnostic
2025.12
-0.214
Feedback
Search any
task
Search any
task