| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning | Mathematical Reasoning Aggregate | Average Score46.63 | 18 | |
| Offline OOD Detection | Mathematical Reasoning Far-shift OOD | AUROC96.54 | 14 | |
| Offline OOD Detection | Mathematical Reasoning Near-shift OOD | AUROC98.76 | 14 | |
| Mathematical Reasoning | Mathematical Reasoning benchmarks (AIME2024, AMC, MATH-500, Minerva, Olympiad) | AIME 2024 Score33.5 | 12 | |
| Mathematical reasoning | Mathematical Reasoning (MATH500, AIME25, OlympiadBench, AMC23) 2023/2025 (test) | MATH Score86.2 | 12 | |
| Mathematical Reasoning | Mathematical Reasoning In-Distribution various (test) | AIME 24 Score33.4 | 12 | |
| OOD Quality Estimation | Mathematical Reasoning OOD (Near-shift) | Kendall Tau0.159 | 12 | |
| OOD Quality Estimation | Mathematical Reasoning Far-shift OOD | Kendall's Tau0.161 | 12 | |
| Mathematical Reasoning | Mathematical Reasoning Average (AMC23, Minerva, Olympiad, Math500, AIME24, AIME25) | Acc@1638.9 | 6 | |
| Mathematical Reasoning | Mathematical Reasoning (OOD) | Algebra222 Accuracy89.9 | 4 | |
| Mathematical Reasoning | Mathematical Reasoning (in-distribution) | GSM8K Score81.7 | 4 | |
| Mathematical Reasoning | Mathematical Reasoning | CA47.19 | 2 |