| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Math reasoning | GSM-Hard (test) | Accuracy55.94 | 30 | |
| Mathematical Reasoning | GSM-Hard OOD | Greedy Accuracy19 | 23 | |
| Mathematical Reasoning | GSM-Hard OOD 1.0 (test) | Greedy Success Rate12 | 9 | |
| Mathematical Reasoning | GSM-Hard Out-of-Distribution (test) | Final Answer Accuracy71 | 5 |