| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Mathematical Reasoning Process Evaluation | PROCESSBENCH | GSM8K Accuracy82.9 | 28 | |
| Process-level Error Localization | PROCESSBENCH | GSM8K Accuracy61 | 20 | |
| Reasoning | ProcessBench | Accuracy69.85 | 20 | |
| Process Verification | ProcessBench Without Standard Answers | Precise Accuracy71.9 | 18 | |
| Process Verification | ProcessBench With Standard Answers | Precise Accuracy78.9 | 18 | |
| Process Reward Model Assessment | PROCESSBENCH | GSM8K Accuracy86.6 | 15 | |
| Process-level verification | ProcessBench Aggregate (test) | Avg F156.5 | 13 | |
| Step-level Correctness Discrimination | ProcessBench GSM8K MATH Olympiad Bench Omni Math | GSM8K Error Rate0.242 | 12 | |
| Mathematical Reasoning | ProcessBench (OlympiaBench) 1.0 (test) | Accuracy79.8 | 10 | |
| Mathematical Reasoning | ProcessBench MATH 1.0 (test) | Accuracy88.4 | 10 | |
| Mathematical Reasoning | ProcessBench GSM8K 1.0 (test) | Accuracy96 | 10 | |
| Correctness Assessment | ProcessBench (test) | Worst-case Size Distortion (QwenPRM)0.24 | 9 | |
| Process-level Evaluation | ProcessBench Average | Mean F136.8 | 7 | |
| Process-level Evaluation | ProcessBench Omni | F1 Score25.6 | 7 | |
| Process-level Evaluation | ProcessBench Olympiad | F1 Score28.7 | 7 | |
| Process-level Evaluation | ProcessBench Math | F1 Score43.8 | 7 | |
| Process-level Evaluation | ProcessBench GSM8K | F1 Score52 | 7 | |
| Step-level classification | ProcessBench (test) | F1 Score75.1 | 6 | |
| Process-level Reward Modeling | PROCESSBENCH Omni-MATH | Error Rate2.8 | 6 | |
| Process-level Reward Modeling | PROCESSBENCH Olymp.Bench | Error3.3 | 6 | |
| Process-level Reward Modeling | PROCESSBENCH MATH | Error Rate6.1 | 6 | |
| Mathematical Reasoning | ProcessBench (PB) | AUC0.766 | 4 |