Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Human-Metric Correlation on RealHumanEval (Out-of-Distribution)
Loading...
0.16
Kendall's Tau
AutoMetrics
0.0196
0.05605
0.0925
0.12895
Dec 19, 2025
Kendall's Tau
Updated 4d ago
Evaluation Results
Method
Method
Links
Kendall's Tau
AutoMetrics
Backbone=GPT-4o-mini
2025.12
0.16
DnA Eval
Backbone=GPT-4o-mini
2025.12
0.152
AutoMetrics
Backbone=Qwen-3-32B
2025.12
0.145
Best Existing Metric
Backbone=Model Agnostic
2025.12
0.138
DnA Eval
Backbone=Qwen-3-32B
2025.12
0.071
LLM-Judge
Backbone=GPT-4o-mini
2025.12
0.069
Finetuned LLM
Backbone=Model Agnostic
2025.12
0.049
MetaMetrics
Backbone=Model Agnostic
2025.12
0.025
LLM-Judge
Backbone=Qwen-3-32B
2025.12
0.025
Feedback
Search any
task
Search any
task