Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-task Generalization on Countdown and OOD Tasks Overall (test)
Loading...
35.9
Accuracy
R1 Distill -> GRPO
15.412
20.731
26.05
31.369
Dec 3, 2025
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
R1 Distill -> GRPO
training=SFT + GRPO
2025.12
35.9
SkillFactory -> GRPO
training=SFT + GRPO
2025.12
35.7
BOLT -> GRPO
training=SFT + GRPO
2025.12
33
RL-Only
training=RL
2025.12
31.9
R1 Distill
training=SFT
2025.12
30.3
STaR -> GRPO
training=SFT + GRPO
2025.12
30.2
Qwen2.5 1.5B Instruct
training=None
2025.12
27.3
SkillFactory
training=SFT
2025.12
25.5
STaR
training=SFT
2025.12
20.4
BOLT
training=SFT
2025.12
16.2
Feedback
Search any
task
Search any
task