| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| AMC23 | TaH+ | Accuracy70.6 | 44 | 14d ago | |
| 200 IMO-level math problems IMO-AnswerBench, IMO-ProofBench, ArXivMath (test) | Meta-Harness | Pass@1 Accuracy50.6 | 36 | 2mo ago | |
| GSM8K | Bifrost | Solve Rate90.22 | 27 | 3mo ago | |
| MWPBENCH (out-of-domain) | WizardMath-Mistral-RL | College Math Acc24.8 | 26 | 3mo ago | |
| AIME 24 | BERT-Judge | Accuracy90 | 24 | 1mo ago | |
| Math Macro-aggregate | Uno-Orchestra | Pass@179.2 | 22 | 27d ago | |
| MMATH (test) | Qwen2.5-7B-Instruct + Vanilla GRPO | Accuracy (Ar)27.1 | 20 | 12d ago | |
| Math Domain (AIME24, Math-OAI, Minerva, Olympiad, ACM23) Qwen2.5-7B (10% selection) | InstructDiff | AIME24 Score7.71 | 18 | 3mo ago | |
| GSM8K | TaH+ | Accuracy91.5 | 17 | 23d ago | |
| OlympiadBench | TaH+ | Accuracy52 | 17 | 1mo ago | |
| MATH | Primitives-based MAS | Accuracy76.4 | 14 | 3mo ago | |
| Math Benchmarks LIMO curation (test) | LALP | Accuracy72.6 | 10 | 1mo ago | |
| AIME 2025 (test) | SelfBudgeter | Accuracy30 | 9 | 1mo ago | |
| GSM8K (test) | Eurus-2-7B-PRIME | Accuracy90.98 | 9 | 1mo ago | |
| LiveMathBench | FoT (Round 2) | AIME 24 Score100 | 4 | 1mo ago | |
| GSM8k, SAT-Math, & MATH OpenCompass AGIEval sampled (test) | CRITIQ | GSM8k Accuracy32.22 | 4 | 3mo ago |