| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MultiArith | Accuracy100 | 229 | 26d ago | ||
| GSM8K | DUP | Accuracy97.1 | 173 | 4d ago | |
| GSM8K (test) | SGE | Accuracy97.35 | 129 | 1mo ago | |
| AddSub | CLoT | Accuracy99 | 123 | 8d ago | |
| Arithmetics | Accuracy100 | 106 | 3d ago | ||
| MultiArith (test) | PaLM | Accuracy99.3 | 67 | 1mo ago | |
| SVAMP | Graph-GRPO | Accuracy96.01 | 61 | 1mo ago | |
| ASDiv | Automatic Model Selection with LLMs | Accuracy93.5 | 58 | 18d ago | |
| AQuA (test) | SGE | Accuracy74.63 | 58 | 1mo ago | |
| SVAMP (test) | SGE | Accuracy98.16 | 54 | 1mo ago | |
| SVAMP | Automatic Model Selection with LLMs | Accuracy (Overall)93.7 | 54 | 1mo ago | |
| In-domain (test) | Accuracy53.4 | 50 | 1mo ago | ||
| SingleEq | PAL | Accuracy98.8 | 47 | 1mo ago | |
| AQuA, GSM8K, MAWPS, SVAMP | Qwen2.5-14B | AQuA Accuracy62.2 | 31 | 1mo ago | |
| AQUA | DUP | Accuracy77.1 | 31 | 1mo ago | |
| GSM8K sampled subset n=200 (test) | Chain-of-Thought Prompting | Accuracy95.5 | 30 | 1mo ago | |
| SVAMP, GSM8K, AddSub, MultiArith, AQUA, SingleEq | DUP | Average Score92.9 | 28 | 8d ago | |
| MATH | LLM-PeerReview-W | Accuracy71 | 23 | 1mo ago | |
| MAWPS | PaLM 540B | Accuracy93.5 | 20 | 1mo ago | |
| Countdown 512 tokens | d-TreeRPO | Pass@162.1 | 15 | 1mo ago | |
| Countdown 256 tokens | d-TreeRPO | Pass@171.1 | 15 | 1mo ago | |
| GSM8K | MFS (Ours) | Pass@187.64 | 14 | 1mo ago | |
| SVAMP latest (test) | Accuracy64.8 | 14 | 1mo ago | ||
| SVAMP | CPO | Accuracy69.3 | 12 | 1mo ago | |
| Game of 24 | ReSCALE | Performance85.3 | 11 | 25d ago |