| MultiArith | | Accuracy100 | | 293 | 7d ago |
| GSM8K | DUP | Accuracy97.1 | | 272 | 7d ago |
| GSM8K (test) | SGE | Accuracy97.35 | | 189 | 14d ago |
| AddSub | CLoT | Accuracy99 | | 149 | 7d ago |
| MultiArith (test) | PaLM | Accuracy99.3 | | 115 | 12d ago |
| Arithmetics | | Accuracy100 | | 106 | 1mo ago |
| SVAMP | Graph-GRPO | Accuracy96.01 | | 87 | 7d ago |
| SingleEq | PAL | Accuracy98.8 | | 73 | 7d ago |
| SVAMP (test) | SGE | Accuracy98.16 | | 70 | 12d ago |
| ASDiv | Automatic Model Selection with LLMs | Accuracy93.5 | | 62 | 1mo ago |
| AQuA (test) | SGE | Accuracy74.63 | | 58 | 3mo ago |
| AQUA | DUP | Accuracy77.1 | | 57 | 7d ago |
| SVAMP | Automatic Model Selection with LLMs | Accuracy (Overall)93.7 | | 54 | 3mo ago |
| In-domain (test) | | Accuracy53.4 | | 50 | 3mo ago |
| SVAMP, GSM8K, AddSub, MultiArith, AQUA, SingleEq | DUP | Average Score92.9 | | 46 | 1mo ago |
| MATH | LLM-PeerReview-W | Accuracy71 | | 39 | 23d ago |
| AQuA, GSM8K, MAWPS, SVAMP | Qwen2.5-14B | AQuA Accuracy62.2 | | 31 | 3mo ago |
| GSM8K sampled subset n=200 (test) | Chain-of-Thought Prompting | Accuracy95.5 | | 30 | 3mo ago |
| Game of 24 (test) | ICRL Preset | Success Rate90 | | 28 | 8d ago |
| Average Arithmetic Reasoning Tasks | | Accuracy84.3 | | 26 | 7d ago |
| Arithmetic Reasoning Benchmarks (MultiArith, GSM8K, AddSub, AQuA, SingleEQ, SVAMP, MAWPS) MATH-10K fine-tuned (test) | S2FT | MultiArith Accuracy99.67 | | 24 | 1mo ago |
| MAWPS | PaLM 540B | Accuracy93.5 | | 20 | 3mo ago |
| Countdown | LIFT2 | Accuracy33.6 | | 19 | 9d ago |
| Matchstick Arithmetic (held-out) | CORE | Accuracy90.7 | | 18 | 6d ago |
| Countdown 512 tokens | d-TreeRPO | Pass@162.1 | | 15 | 3mo ago |