Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Mathematical Reasoning on Countdown 4,5,6-arg held-out difficulties (test)
Loading...
25.1
Accuracy
SkillFactory -> GRPO
-0.484
6.158
12.8
19.442
Dec 3, 2025
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
SkillFactory -> GRPO
base_model=Qwen2.5-1.5...
2025.12
25.1
R1 Distill -> GRPO
base_model=Qwen2.5-1.5...
2025.12
21.2
RL-Only
base_model=Qwen2.5-1.5...
2025.12
15.8
BOLT -> GRPO
base_model=Qwen2.5-1.5...
2025.12
13.7
R1 Distill
base_model=Qwen2.5-1.5...
2025.12
11.7
STaR -> GRPO
base_model=Qwen2.5-1.5...
2025.12
9.7
SkillFactory
base_model=Qwen2.5-1.5...
2025.12
2.8
STaR
base_model=Qwen2.5-1.5...
2025.12
2.6
Qwen2.5 1.5B Instruct
base_model=Qwen2.5-1.5...
2025.12
1.9
BOLT
base_model=Qwen2.5-1.5...
2025.12
0.5
Feedback
Search any
task
Search any
task