| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | Reasoning Benchmark Suite Aggregate | Average Score59.44 | 36 | |
| General Reasoning | Reasoning Benchmark Suite (GSM8K, MATH500, GPQA, CSQA, AQuA, MMLU) | Average Accuracy86.61 | 7 | |
| Mathematical and Science Reasoning | Reasoning Benchmark Suite (MATH500, GSM8K, AMC23, Minerva, MMLU, MMLU-Pro, GPQA) | MATH500 Score81.15 | 2 |