| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM-Hard | Accuracy99 | 169 | |
| Mathematical Reasoning | GSM-Hard | Solve Rate78 | 162 | |
| Reasoning | GSM PRO | Accuracy100 | 72 | |
| Math Reasoning | GSM Hard | Accuracy82.6 | 67 | |
| Mathematical Reasoning | GSM-Hard | Accuracy89.52 | 46 | |
| Reasoning | GSM→FOL | Accuracy85.8 | 45 | |
| Mathematical Reasoning | GSM | Accuracy94 | 45 | |
| Mathematical Reasoning | GSM (test) | Accuracy65.4 | 42 | |
| Mathematical Reasoning | GSM Hard | Accuracy68.6 | 28 | |
| Mathematical Reasoning | GSM-Hard | GSM-Hard pass@1 Acc69.6 | 27 | |
| Mathematical Reasoning | GSM | Accuracy61 | 27 | |
| Mathematical Reasoning (Calculator) | GSM-PLUS | Accuracy76.54 | 25 | |
| Math | GSM-Plus | Score89.74 | 22 | |
| Mathematical Reasoning | GSM-ICM | Accuracy92.7 | 16 | |
| Math Reasoning | GSM-H (held-out) | Accuracy (%)57.54 | 14 | |
| Mathematical Reasoning | GSM 8K | pass@K97.77 | 12 | |
| Multi-objective reinforcement learning | RLVR-GSM | Multiplicative Gap (ε)0.0112 | 12 | |
| Grade-school reasoning | GSM Hard | Pass@1 Success Rate53.4 | 9 | |
| Correctness verification | GSM-Symbolic | LB0.435 | 8 | |
| Math Reasoning | GSM DE | Accuracy66 | 7 | |
| Math Reasoning | GSM CoT | Accuracy (GSM CoT)83.2 | 7 | |
| Mathematical Reasoning | GSM | GSM Accuracy92.16 | 7 | |
| Arithmetic Reasoning | GSM Reversed | Accuracy90.3 | 7 | |
| Mathematical Reasoning | GSM-SYS | Accuracy80.9 | 7 | |
| Compiler phase ordering | gsm | Execution Cycles6,178 | 7 |