Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Rating on Held-out games (test)
Loading...
0.49
MAE
BG-Critic
0.4576
0.6763
0.895
1.1137
Jun 1, 2026
MAE
Absolute Bias
Kendall's Tau
Updated 1d ago
Evaluation Results
Method
Method
Links
MAE
Absolute Bias
Kendall's Tau
BG-Critic
Stage 2 multi-task tra...
2026.06
0.49
0.16
0.368
BG-Critic
Stage 2 multi-task tra...
2026.06
0.5
0.18
0.36
Qwen3.5-27B
Backbone Status=untuned
2026.06
0.76
0.72
0.186
Qwen3.5-397B
2026.06
1.05
1.04
0.232
Gemini-3.1-Flash
2026.06
1.11
1.09
0.25
GPT-5.4
2026.06
1.3
1.29
0.336
Feedback
Search any
task
Search any
task