| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning | SVAMP | Accuracy97 | 403 | |
| Mathematical Reasoning | SVAMP (test) | Accuracy94 | 262 | |
| Arithmetic Reasoning | SVAMP | Accuracy96.01 | 61 | |
| Arithmetic Reasoning | SVAMP (test) | Accuracy98.16 | 54 | |
| Arithmetic Reasoning | SVAMP | Accuracy (Overall)93.7 | 54 | |
| Mathematical Reasoning | SVAMP out-of-domain (test) | Accuracy97 | 50 | |
| Hallucination Detection | SVAMP | Mean AUROC78.37 | 48 | |
| Math Reasoning | SVAMP | Accuracy94.2 | 40 | |
| Math Word Problem solving | SVAMP | Value Accuracy94.5 | 38 | |
| Mathematical Reasoning | SVAMP (val) | Accuracy85.1 | 36 | |
| Mathematical Reasoning | SVAMP | Pass@193.1 | 35 | |
| Speculative Sampling | SVAMP | Average Acceptance Length5.38 | 28 | |
| Group Collusive Attack Detection | SVAMP | Detection Accuracy92 | 27 | |
| Mathematical Reasoning | SVAMP 8-shot (test) | Accuracy92 | 25 | |
| Mathematical Reasoning | SVAMP | Accuracy94.31 | 21 | |
| Mathematical Reasoning | SVAMP | AUROC0.6211 | 20 | |
| Math Word Problem solving | SVAMP English (test) | Accuracy67.8 | 20 | |
| Mathematical Reasoning | SVAMP | Pass@597 | 16 | |
| Inference Attack | SVAMP | AUC97.61 | 15 | |
| Mathematical Reasoning | SVAMP | Accuracy94.1 | 15 | |
| Mathematical Word Problem Solving | SVAMP | Accuracy96.6 | 14 | |
| Mathematical Reasoning | SVAMP | Accuracy83.33 | 14 | |
| Math Reasoning | SVAMP (held-out) | Performance78.3 | 14 | |
| Arithmetic Reasoning | SVAMP latest (test) | Accuracy64.8 | 14 | |
| Mathematical Reasoning | SVAMP | Accuracy79.53 | 12 |