| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning | SVAMP | Accuracy97 | 368 | |
| Mathematical Reasoning | SVAMP (test) | Accuracy94 | 233 | |
| Arithmetic Reasoning | SVAMP (test) | Accuracy98.16 | 54 | |
| Arithmetic Reasoning | SVAMP | Accuracy (Overall)93.7 | 54 | |
| Mathematical Reasoning | SVAMP out-of-domain (test) | Accuracy97 | 50 | |
| Hallucination Detection | SVAMP | Mean AUROC78.37 | 48 | |
| Arithmetic Reasoning | SVAMP | Accuracy94.2 | 48 | |
| Math Word Problem solving | SVAMP | Value Accuracy94.5 | 38 | |
| Mathematical Reasoning | SVAMP (val) | Accuracy85.1 | 36 | |
| Mathematical Reasoning | SVAMP | Pass@193.1 | 35 | |
| Mathematical Reasoning | SVAMP | Accuracy94.31 | 21 | |
| Mathematical Reasoning | SVAMP | AUROC0.6211 | 20 | |
| Math Word Problem solving | SVAMP English (test) | Accuracy67.8 | 20 | |
| Mathematical Reasoning | SVAMP | Pass@597 | 16 | |
| Mathematical Reasoning | SVAMP | Accuracy83.33 | 14 | |
| Math Reasoning | SVAMP (held-out) | Performance78.3 | 14 | |
| Arithmetic Reasoning | SVAMP latest (test) | Accuracy64.8 | 14 | |
| Mathematical Reasoning | SVAMP | Verifiability Score97.33 | 12 | |
| Mathematical Reasoning | SVAMP | Reusability Score71.11 | 12 | |
| Arithmetic Reasoning | SVAMP | Accuracy69.3 | 12 | |
| Math Reasoning | SVAMP | Accuracy78.67 | 10 | |
| Mathematical Reasoning | SVAMP | Accuracy (Context Size 128)0.8933 | 9 | |
| Mathematical Reasoning | SVAMP | Accuracy59.8 | 9 | |
| Mathematical Reasoning | SVAMP | Answer Correctness Rate53.8 | 8 | |
| Uncertainty Estimation | SVAMP | AUROC93.6 | 7 |