Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PROCESSBENCH

Benchmarks

Task NameDataset NameSOTA ResultTrend
Mathematical Reasoning Process EvaluationPROCESSBENCH
GSM8K Accuracy82.9
28
Process-level Error LocalizationPROCESSBENCH
GSM8K Accuracy61
20
ReasoningProcessBench
Accuracy69.85
20
Process VerificationProcessBench Without Standard Answers
Precise Accuracy71.9
18
Process VerificationProcessBench With Standard Answers
Precise Accuracy78.9
18
Process Reward Model AssessmentPROCESSBENCH
GSM8K Accuracy86.6
15
Process-level verificationProcessBench Aggregate (test)
Avg F156.5
13
Step-level Correctness DiscriminationProcessBench GSM8K MATH Olympiad Bench Omni Math
GSM8K Error Rate0.242
12
Mathematical ReasoningProcessBench (OlympiaBench) 1.0 (test)
Accuracy79.8
10
Mathematical ReasoningProcessBench MATH 1.0 (test)
Accuracy88.4
10
Mathematical ReasoningProcessBench GSM8K 1.0 (test)
Accuracy96
10
Correctness AssessmentProcessBench (test)
Worst-case Size Distortion (QwenPRM)0.24
9
Process-level EvaluationProcessBench Average
Mean F136.8
7
Process-level EvaluationProcessBench Omni
F1 Score25.6
7
Process-level EvaluationProcessBench Olympiad
F1 Score28.7
7
Process-level EvaluationProcessBench Math
F1 Score43.8
7
Process-level EvaluationProcessBench GSM8K
F1 Score52
7
Step-level classificationProcessBench (test)
F1 Score75.1
6
Process-level Reward ModelingPROCESSBENCH Omni-MATH
Error Rate2.8
6
Process-level Reward ModelingPROCESSBENCH Olymp.Bench
Error3.3
6
Process-level Reward ModelingPROCESSBENCH MATH
Error Rate6.1
6
Mathematical ReasoningProcessBench (PB)
AUC0.766
4
Showing 22 of 22 rows