| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | Reasoning Evaluation Suite Math, Symbolic, and Commonsense (test) | Math Accuracy80.8 | 33 | |
| Reasoning | Reasoning Evaluation Suite AIME 2024, GSM8k, MATH 500, GPQA | AIME 2024 Score60 | 32 | |
| Reasoning | Reasoning Evaluation Suite (MATH, GSM8K, AQUA, GSM-H, MMLU, MMLU-P, AIME) (test) | MATH Score52.4 | 8 |