| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning | PRM800K (test) | Accuracy80 | 15 | |
| First-error detection | PRM800K | Accuracy92.9 | 6 | |
| Step-level hallucination detection | PRM800K | AUROC99.8 | 6 | |
| Stepwise Confidence Attribution | PRM800K | AUROC0.8181 | 5 | |
| Math Reasoning | PRM800K | AUC-ROC0.613 | 5 | |
| Instance-level Evaluation | PRM800K | AUC-ROC0.42 | 1 |