| MultiArith | | Accuracy100 | | 181 | 2d ago |
| GSM8K | DUP | Accuracy97.1 | | 155 | 2d ago |
| GSM8K (test) | SGE | Accuracy97.35 | | 129 | 2d ago |
| AddSub | Teaching-Inspired Integrated Prompting Framework | Accuracy98.2 | | 76 | 2d ago |
| MultiArith (test) | PaLM | Accuracy99.3 | | 67 | 3d ago |
| AQuA (test) | SGE | Accuracy74.63 | | 58 | 2d ago |
| SVAMP (test) | SGE | Accuracy98.16 | | 54 | 2d ago |
| ASDiv | Automatic Model Selection with LLMs | Accuracy93.5 | | 54 | 2d ago |
| SVAMP | Automatic Model Selection with LLMs | Accuracy (Overall)93.7 | | 54 | 3d ago |
| In-domain (test) | | Accuracy53.4 | | 50 | 3d ago |
| SVAMP | DUP | Accuracy94.2 | | 48 | 2d ago |
| SingleEq | PAL | Accuracy98.8 | | 43 | 2d ago |
| AQUA | DUP | Accuracy77.1 | | 31 | 2d ago |
| GSM8K sampled subset n=200 (test) | Chain-of-Thought Prompting | Accuracy95.5 | | 30 | 3d ago |
| MAWPS | PaLM 540B | Accuracy93.5 | | 20 | 3d ago |
| MATH | LLM-PeerReview-W | Accuracy71 | | 16 | 3d ago |
| Countdown 512 tokens | d-TreeRPO | Pass@162.1 | | 15 | 3d ago |
| Countdown 256 tokens | d-TreeRPO | Pass@171.1 | | 15 | 3d ago |
| GSM8K | MFS (Ours) | Pass@187.64 | | 14 | 3d ago |
| SVAMP latest (test) | | Accuracy64.8 | | 14 | 3d ago |
| SVAMP | CPO | Accuracy69.3 | | 12 | 3d ago |
| Long Multiplication 2,3,4,5-digit (OOD) | R1 Distill -> GRPO | Accuracy37.1 | | 10 | 3d ago |
| GSM8K 4-shot CoT | ADAFUSE (Top-2 Base) | Accuracy90.25 | | 10 | 3d ago |
| SVAMP, GSM8K, AddSub, MultiArith, AQUA, SingleEq | DUP | Average Score92.9 | | 10 | 3d ago |
| GSM8K | FP16 baseline | ACC30.25 | | 10 | 3d ago |