Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-task Generalization on Countdown and OOD Tasks Overall (test)

35.9Accuracy

R1 Distill -> GRPO

15.41220.73126.0531.369Dec 3, 2025
Updated 1mo ago

Evaluation Results

MethodLinks
2025.12
35.9
2025.12
35.7
2025.12
33
2025.12
31.9
2025.12
30.3
2025.12
30.2
2025.12
27.3
2025.12
25.5
2025.12
20.4
2025.12
16.2