Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Comparison on Held-out games (test)
Loading...
96.8
Accuracy
BG-Critic
72.152
78.551
84.95
91.349
Jun 1, 2026
Accuracy
Updated 1d ago
Evaluation Results
Method
Method
Links
Accuracy
BG-Critic
Stage 2 multi-task tra...
2026.06
96.8
GPT-5.4
2026.06
92.1
Qwen3.5-27B
Backbone Status=untuned
2026.06
90.2
Gemini-3.1-Flash
2026.06
88.7
Qwen3.5-397B
2026.06
85.5
BG-Critic
Stage 2 multi-task tra...
2026.06
73.1
Feedback
Search any
task
Search any
task