| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Math Reasoning | AQuA | Accuracy93.55 | 188 | |
| Mathematical Reasoning | AQUA | Accuracy85.05 | 167 | |
| Algebraic Reasoning | AQUA | Accuracy91.89 | 65 | |
| Mathematical Reasoning | AQuA | AQuA Exact Match79.92 | 60 | |
| Arithmetic Reasoning | AQuA (test) | Accuracy74.63 | 58 | |
| Arithmetic Reasoning | AQUA | Accuracy77.1 | 57 | |
| Mathematical Reasoning | AQuA | Accuracy87.01 | 45 | |
| Multiple-choice Question Answering | AQuA | Accuracy89.45 | 43 | |
| Hallucination Detection | AQuA | AUROC0.7822 | 31 | |
| Marine species classification | AQUA20 (test) | Macro F188.9 | 28 | |
| Symbolic Reasoning | AQUA | Accuracy80.3 | 26 | |
| Reasoning | AQuA | CACC (%)72 | 25 | |
| Hybrid Reasoning | AQUA (test) | Accuracy78.5 | 24 | |
| Mathematical Reasoning | AQUA (test) | Accuracy72.44 | 18 | |
| Mathematical Reasoning | AQuA | Accuracy (Without Verifier)74 | 16 | |
| Algebraic Reasoning | AQuA | Performance (%)66.36 | 12 | |
| CoT faithfulness detection | AQuA | Accuracy (CoT Faithfulness)77 | 12 | |
| Complex Reasoning | AQuA | Accuracy28.35 | 12 | |
| Mathematical Reasoning | AQuA | FRS96.8 | 9 | |
| Mathematical Reasoning | AQUA (val) | Tokens at Best Step (K)336 | 7 | |
| Algebraic Reasoning | AQUA (test) | Accuracy30.94 | 6 | |
| Mathematical Reasoning | AQUA | Answer Selection Rate (ASR)94.4 | 4 | |
| Mathematical Reasoning | AQuA | Mean Accuracy93.42 | 3 | |
| Algebraic Reasoning | AQUA | PPL22.7 | 3 | |
| CoT Soundness Evaluation | AQuA | CSR90 | 3 |