Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PROCESSBENCH

Benchmarks

Task NameDataset NameSOTA ResultTrend
Process-level Error LocalizationPROCESSBENCH
GSM8K Accuracy88
44
Mathematical Reasoning Process EvaluationProcessBench (test)
GSM8K Accuracy96.9
35
Mathematical Reasoning Process EvaluationPROCESSBENCH
GSM8K Accuracy82.9
28
ReasoningProcessBench
Accuracy69.85
20
Process Reward Model AssessmentPROCESSBENCH
GSM8K Accuracy87.3
20
Process VerificationProcessBench Without Standard Answers
Precise Accuracy71.9
18
Process VerificationProcessBench With Standard Answers
Precise Accuracy78.9
18
Process Reward ModelingProcessBench 1.0 (test)
GSM8K Score87.3
14
Step-wise VerificationProcessBench Overall
F1 Score72.3
13
Step-wise VerificationProcessBench Omni-MATH
TNR63.1
13
Step-wise VerificationProcessBench OlympiadBench
TNR61.3
13
Step-wise VerificationProcessBench Math
TNR69.5
13
Step-wise VerificationProcessBench GSM8K v1 (val)
True Negative Rate68.6
13
Process-level verificationProcessBench Aggregate (test)
Avg F156.5
13
Step-level Correctness DiscriminationProcessBench GSM8K MATH Olympiad Bench Omni Math
GSM8K Error Rate0.242
12
Faithfulness detectionProcessBench
F1 Score83.2
10
Mathematical ReasoningProcessBench (OlympiaBench) 1.0 (test)
Accuracy79.8
10
Mathematical ReasoningProcessBench MATH 1.0 (test)
Accuracy88.4
10
Mathematical ReasoningProcessBench GSM8K 1.0 (test)
Accuracy96
10
Correctness AssessmentProcessBench (test)
Worst-case Size Distortion (QwenPRM)0.24
9
Process-level EvaluationProcessBench Average
Mean F136.8
7
Process-level EvaluationProcessBench Omni
F1 Score25.6
7
Process-level EvaluationProcessBench Olympiad
F1 Score28.7
7
Process-level EvaluationProcessBench Math
F1 Score43.8
7
Process-level EvaluationProcessBench GSM8K
F1 Score52
7
Showing 25 of 32 rows