| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning | AQUA | Accuracy85.05 | 146 | |
| Math Reasoning | AQuA | Accuracy91.8 | 78 | |
| Algebraic Reasoning | AQUA | Accuracy91.89 | 61 | |
| Mathematical Reasoning | AQuA | AQuA Exact Match79.92 | 60 | |
| Arithmetic Reasoning | AQuA (test) | Accuracy74.63 | 58 | |
| Hallucination Detection | AQuA | AUROC0.7822 | 31 | |
| Multiple-choice Question Answering | AQuA | Accuracy87.4 | 31 | |
| Arithmetic Reasoning | AQUA | Accuracy77.1 | 31 | |
| Marine species classification | AQUA20 (test) | Macro F188.9 | 28 | |
| Symbolic Reasoning | AQUA | Accuracy80.3 | 26 | |
| Reasoning | AQuA | CACC (%)72 | 25 | |
| Mathematical Reasoning | AQuA | Accuracy (Without Verifier)74 | 16 | |
| Mathematical Reasoning | AQuA | FRS96.8 | 9 | |
| Mathematical Reasoning | AQUA (val) | Tokens at Best Step (K)336 | 7 | |
| Mathematical Reasoning | AQUA (test) | Accuracy72.44 | 6 | |
| CoT Soundness Evaluation | AQuA | CSR90 | 3 | |
| CoT Naturalness | AQuA | PPL21.34 | 3 | |
| Arithmetic Reasoning | AQUA | Accuracy (format-specific prompt)33.5 | 2 | |
| Algebraic Reasoning | AQUA (test) | Accuracy- | 0 |