| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical reasoning | MathQA | Accuracy98.84 | 305 | |
| Question Answering | MathQA (test) | Accuracy81.05 | 41 | |
| Math Word Problem solving | MathQA (test) | Accuracy81.5 | 34 | |
| Mathematical Reasoning | MathQA (test) | Accuracy87.6 | 33 | |
| Mathematical Reasoning | MathQA | Retention25.19 | 28 | |
| Zero-shot Reasoning | MathQA | Accuracy28.4 | 26 | |
| Reasoning | MathQA | CACC75.9 | 25 | |
| Correctness Prediction | MathQA | Accuracy66.15 | 18 | |
| Reasoning | MathQA leave-one-out setup | Average Accuracy56.9 | 12 | |
| Mathematical Reasoning | MathQA | Average Acceptance Length τ2,555 | 12 | |
| Question Answering | MathQA | Accuracy78.7 | 12 | |
| mathematical computation | MathQA | Exact Match (EM)52.34 | 10 | |
| Math Programming | MathQA Python | Pass@8087.4 | 8 | |
| Downstream Task | MathQA | Accuracy24.32 | 7 | |
| Numerical Question Answering | MathQA (test) | Program Accuracy83 | 6 | |
| Common Sense Reasoning | MathQA | Accuracy64 | 4 | |
| Code Generation | MathQA Python Original (test) | Pass@8084.7 | 4 | |
| CoT Soundness Evaluation | MathQA | CSR92 | 3 | |
| CoT Naturalness | MathQA | PPL22.1 | 3 | |
| Code Generation | MathQA | Normalized Performance100.79 | 3 | |
| Human Evaluation | MathQA | Accuracy89.2 | 3 | |
| Code Generation | MathQA Python Filtered (dev) | PASS@120.7 | 3 | |
| Multiple Choice Question Answering | MathQA | Accuracy22.21 | 2 |