Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Diagnostic on Held-out games (test)
Loading...
6.07
Quality Score
BG-Critic
1.5876
2.7513
3.915
5.0787
Jun 1, 2026
Quality Score
Updated 1d ago
Evaluation Results
Method
Method
Links
Quality Score
BG-Critic
Stage 2 multi-task tra...
2026.06
6.07
GPT-5.4
2026.06
3.92
Gemini-3.1-Flash
2026.06
2.64
Qwen3.5-397B
2026.06
2.39
Qwen3.5-27B
Backbone Status=untuned
2026.06
1.99
BG-Critic
Stage 2 multi-task tra...
2026.06
1.76
Feedback
Search any
task
Search any
task