| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning | Countdown | Accuracy85 | 168 | |
| Planning | Countdown | Accuracy82 | 68 | |
| Mathematical Reasoning | Countdown (test) | Accuracy51.2 | 36 | |
| Reasoning | Countdown | Accuracy83.2 | 32 | |
| Symbolic Reasoning | Countdown | Accuracy49.61 | 24 | |
| Logical Reasoning | Countdown | Accuracy52 | 16 | |
| Arithmetic Reasoning | Countdown 512 tokens | Pass@162.1 | 15 | |
| Arithmetic Reasoning | Countdown 256 tokens | Pass@171.1 | 15 | |
| Planning | Countdown (held-out) | Pass@187.96 | 14 | |
| Logical Reasoning | Countdown CD34 | Avg@1678.2 | 14 | |
| Logical Reasoning | Countdown CD4 | Avg@1659.4 | 14 | |
| Numerical Reasoning | Countdown-4 | CD498.9 | 13 | |
| Reasoning | COUNTDOWN (test) | Accuracy66.02 | 13 | |
| Mathematical Reasoning | Countdown 4,5,6-arg held-out difficulties (test) | Accuracy25.1 | 10 | |
| Mathematical Reasoning | Countdown 8B Instruct (test) | Accuracy46.1 | 9 | |
| Mathematical Reasoning | Countdown | Accuracy (L=128)39.84 | 9 | |
| Mathematical Reasoning | Countdown-34 (held-out) | Accuracy81.26 | 8 | |
| Uncertainty Quantification | Countdown | ROC-AUC (128)0.61 | 8 | |
| Arithmetic Reasoning | Countdown 0-shot (test) | Pass@1 (Greedy)71.5 | 7 | |
| Arithmetic Reasoning | Countdown | Pass@192 | 6 | |
| Symbolic planning | Countdown | Exact-match Accuracy (Ngen=128)40.6 | 6 | |
| Reasoning | Countdown | Average Diffusion Steps40.4 | 6 | |
| Logical Reasoning | Countdown (test) | Accuracy Pass@174.7 | 5 | |
| Mathematical Reasoning | Countdown (CTD) (test) | Accuracy43.8 | 4 | |
| Planning | Countdown | Score (%)15.3 | 4 |