| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Arithmetic Reasoning | MultiArith | Accuracy100 | 229 | |
| Mathematical Reasoning | MultiArith | Accuracy100 | 143 | |
| Arithmetic Reasoning | MultiArith (test) | Accuracy99.3 | 67 | |
| Math Reasoning | MultiArith | Accuracy98.3 | 65 | |
| Mathematical Reasoning | MultiArith | Original Accuracy99 | 40 | |
| Math reasoning | MultiArith (test) | Accuracy99.59 | 30 | |
| Mathematical reasoning | MultiArith Out of Distribution | Top-1 Accuracy (Maj@1)100 | 30 | |
| Group Collusive Attack Detection | MultiArith | Detection Accuracy92 | 27 | |
| Question Answering | MultiArith | Accuracy74.3 | 24 | |
| Mathematical Reasoning | MultiArith | Accuracy100 | 16 | |
| Math Reasoning | MultiArith | Accuracy98.3 | 14 | |
| Follow-up Questioning Consistency | MultiArith (unseen) | Average Success Count18.33 | 12 | |
| Mathematical Reasoning | MultiArith | Accuracy43.33 | 10 | |
| Mathematical Reasoning | MultiArith | Accuracy (Clean)99.44 | 8 | |
| Mathematical Reasoning | MultiArith (val) | Tokens at Best Step (K)1,640 | 7 | |
| Mathematical Reasoning | MultiArith OOD | Base Accuracy (CA)100 | 2 | |
| Arithmetic Reasoning | MultiArith | Accuracy (format-specific prompt)78.7 | 2 |