| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Planning | Countdown | Accuracy82 | 68 | |
| Mathematical Reasoning | Countdown (test) | Accuracy51.2 | 36 | |
| Mathematical Reasoning | Countdown | Accuracy60.16 | 36 | |
| Reasoning | Countdown | Accuracy83.2 | 24 | |
| Symbolic Reasoning | Countdown | Accuracy49.61 | 24 | |
| Arithmetic Reasoning | Countdown 512 tokens | Pass@162.1 | 15 | |
| Arithmetic Reasoning | Countdown 256 tokens | Pass@171.1 | 15 | |
| Logical Reasoning | Countdown CD34 | Avg@1678.2 | 14 | |
| Logical Reasoning | Countdown CD4 | Avg@1659.4 | 14 | |
| Reasoning | COUNTDOWN (test) | Accuracy66.02 | 13 | |
| Mathematical Reasoning | Countdown 4,5,6-arg held-out difficulties (test) | Accuracy25.1 | 10 | |
| Mathematical Reasoning | Countdown | Accuracy (L=128)39.84 | 9 | |
| Symbolic planning | Countdown | Exact-match Accuracy (Ngen=128)40.6 | 6 | |
| Reasoning | Countdown | Average Diffusion Steps40.4 | 6 | |
| Mathematical Reasoning | Countdown (val) | Accuracy41.9 | 3 | |
| Countdown | Countdown (test) | Accuracy- | 0 |