| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| GSM8K | BERT-Judge | Accuracy0.988 | 206 | 4d ago | |
| MATH-500 | GLM-4.7-Flash-T | Accuracy99.2 | 86 | 12d ago | |
| GSM8K | Pass@196.51 | 47 | 3d ago | ||
| GSM8K 5-shot | UltraMix-190k | Score82.7 | 46 | 1mo ago | |
| AIME24 | DSMoE | Accuracy77.3 | 38 | 10d ago | |
| Math (test) | OPCD | Accuracy80.9 | 36 | 1mo ago | |
| AIME 2025 | ePF | Top-1 Score28.83 | 26 | 18d ago | |
| GSM-Plus | LLaDA2.0-flash | Score89.74 | 22 | 1mo ago | |
| MATH | UM-190k | Score11.59 | 18 | 1mo ago | |
| GSM8K (test) | ReSyn | Mean@491.4 | 18 | 1mo ago | |
| Math (val) | GRPO | Pass@16100 | 16 | 1mo ago | |
| GSM8K, MATH-500 (test) | REAP | GSM8K Accuracy92.6 | 15 | 29d ago | |
| OMEGA | Qwen 3 VL 32B Instruct | Accuracy44 | 13 | 1mo ago | |
| MATH | CapFlow | Solve Rate59.87 | 11 | 1mo ago | |
| GSM8K | CapFlow | Solve Rate94.97 | 11 | 1mo ago | |
| Omni-MATH | LLaDA2.1-flash | Score54.1 | 10 | 1mo ago | |
| CMATH | LLaDA2.0-flash | Score96.9 | 10 | 1mo ago | |
| OlympiadBench | Score77.59 | 10 | 1mo ago | ||
| AIME 2025 | LLaDA2.1-flash | Score63.33 | 10 | 1mo ago | |
| GSM8K 1,000-example (test) | Qwen3-VL-2B-Instruct | PPL5.8317 | 10 | 1mo ago | |
| HRM8K | EM Score89.08 | 9 | 29d ago | ||
| MATH 4-shot | UM-190k | Score15.59 | 9 | 1mo ago | |
| MATH 4-shot | UM-190k | Accuracy5.27 | 9 | 1mo ago | |
| IMO-ANSWERBENCH | Score53.8 | 9 | 1mo ago | ||
| Math | Global Surgery | Math Score60.8 | 8 | 1mo ago |