| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning | Math Benchmarks Aggregate | Accuracy (Avg)81.9 | 62 | |
| Mathematical Reasoning | Math Benchmarks GSM8K, Minerva, MATH, MathQA | GSM8K Score59.24 | 53 | |
| Mathematical Reasoning | Math Benchmarks Average | Accuracy (ACC)76.1 | 47 | |
| Mathematical Reasoning | Math Benchmarks Aggregate | Pass@171.8 | 44 | |
| Mathematical Reasoning | Math Benchmarks AIME 2024 AIME 2025 OlympiadBench | AIME 2024 Score19.5 | 19 | |
| Math Reasoning | Mean of six math benchmarks | Pass@143.8 | 12 | |
| Mathematical Reasoning | Math Benchmarks Overall (test) | Pass@187 | 12 | |
| Math Problem Solving | Math Benchmarks LIMO curation (test) | Accuracy72.6 | 10 | |
| Math Reasoning | Math Benchmarks MATH, GSM8K, AMC23, AIME24, Minerva, Gaokao, Olympiad (test) | MATH Score75.1 | 10 | |
| Mathematical Reasoning | Math Benchmarks (GSM8K, MATH, AMC23, AIME24) (test) | Accuracy (GSM8K)96 | 8 | |
| Mathematical Reasoning | Math Benchmarks Math500, OlympiadBench, Minerva, AIME, AMC | Math500 Accuracy85.6 | 7 | |
| Mathematical Reasoning | Math Benchmarks evaluated on Llama 3-70B | GSM8K Accuracy78.2 | 5 | |
| Mathematical Reasoning | Math Benchmarks MATH, MATH500, ThmQA | MATH multi@5 Accuracy67.6 | 4 | |
| Mathematical Reasoning | Math Benchmarks (test) | GSM8K Accuracy28.9 | 3 |