Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Strategic Reasoning on RandomValue Negotiation OOD (held-out variant)
Loading...
17.08
Win Rate
DEPT
-0.548
4.0285
8.605
13.1815
May 9, 2026
Win Rate
Updated 22d ago
Evaluation Results
Method
Method
Links
Win Rate
DEPT
Backbone=Qwen3-8B-Base
2026.05
17.08
DEPT
Backbone=Qwen3-4B-Base
2026.05
15.36
GRPO
Backbone=Qwen3-4B-Base
2026.05
14.52
SPAG
Backbone=Qwen3-8B-Base
2026.05
13.02
SPIRAL
Backbone=Qwen3-4B-Base
2026.05
12.56
MARS
Backbone=Qwen3-4B-Base
2026.05
12.55
SPAG
Backbone=Qwen3-4B-Base
2026.05
11.85
MARS
Backbone=Qwen3-8B-Base
2026.05
7.56
GRPO
Backbone=Qwen3-8B-Base
2026.05
6.64
SPIRAL
Backbone=Qwen3-8B-Base
2026.05
6.52
VANILLA
Backbone=Qwen3-8B-Base
2026.05
3.78
VANILLA
Backbone=Qwen3-4B-Base
2026.05
0.13
Feedback
Search any
task
Search any
task