Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Human-Metric Correlation on CoGym (Out-of-Distribution)
Loading...
0.365
Kendall's Tau
AutoMetrics
-0.13836
-0.00768
0.123
0.25368
Dec 19, 2025
Kendall's Tau
Updated 4d ago
Evaluation Results
Method
Method
Links
Kendall's Tau
AutoMetrics
Backbone=Qwen-3-32B
2025.12
0.365
DnA Eval
Backbone=Qwen-3-32B
2025.12
0.353
LLM-Judge
Backbone=Qwen-3-32B
2025.12
0.276
Finetuned LLM
Backbone=Model Agnostic
2025.12
0.223
LLM-Judge
Backbone=GPT-4o-mini
2025.12
0.199
DnA Eval
Backbone=GPT-4o-mini
2025.12
0.185
Best Existing Metric
Backbone=Model Agnostic
2025.12
0.074
AutoMetrics
Backbone=GPT-4o-mini
2025.12
-0.034
MetaMetrics
Backbone=Model Agnostic
2025.12
-0.119
Feedback
Search any
task
Search any
task