Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Strategic Reasoning on HardCore Don'tSayIt OOD (held-out variant)
Loading...
22.92
Win Rate
DEPT
-0.5112
5.5719
11.655
17.7381
May 9, 2026
Win Rate
Updated 22d ago
Evaluation Results
Method
Method
Links
Win Rate
DEPT
Backbone=Qwen3-4B-Base
2026.05
22.92
GRPO
Backbone=Qwen3-4B-Base
2026.05
22.01
DEPT
Backbone=Qwen3-8B-Base
2026.05
19.27
SPIRAL
Backbone=Qwen3-8B-Base
2026.05
18.88
SPIRAL
Backbone=Qwen3-4B-Base
2026.05
17.97
SPAG
Backbone=Qwen3-8B-Base
2026.05
12.12
MARS
Backbone=Qwen3-4B-Base
2026.05
12.11
GRPO
Backbone=Qwen3-8B-Base
2026.05
10.03
MARS
Backbone=Qwen3-8B-Base
2026.05
8.33
SPAG
Backbone=Qwen3-4B-Base
2026.05
7.03
VANILLA
Backbone=Qwen3-8B-Base
2026.05
2.34
VANILLA
Backbone=Qwen3-4B-Base
2026.05
0.39
Feedback
Search any
task
Search any
task