Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Strategic Reasoning on VariableSum Dollar OOD (held-out variant)
Loading...
30.47
Win Rate
DEPT
1.6204
9.1102
16.6
24.0898
May 9, 2026
Win Rate
Updated 22d ago
Evaluation Results
Method
Method
Links
Win Rate
DEPT
Backbone=Qwen3-4B-Base
2026.05
30.47
GRPO
Backbone=Qwen3-4B-Base
2026.05
27.99
DEPT
Backbone=Qwen3-8B-Base
2026.05
27.73
SPAG
Backbone=Qwen3-8B-Base
2026.05
26.56
SPAG
Backbone=Qwen3-4B-Base
2026.05
26.17
MARS
Backbone=Qwen3-4B-Base
2026.05
25.39
MARS
Backbone=Qwen3-8B-Base
2026.05
24.48
SPIRAL
Backbone=Qwen3-4B-Base
2026.05
23.96
SPIRAL
Backbone=Qwen3-8B-Base
2026.05
23.18
GRPO
Backbone=Qwen3-8B-Base
2026.05
19.14
VANILLA
Backbone=Qwen3-4B-Base
2026.05
3.78
VANILLA
Backbone=Qwen3-8B-Base
2026.05
2.73
Feedback
Search any
task
Search any
task