Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PRM800K

Benchmarks

Task NameDataset NameSOTA ResultTrend
Mathematical ReasoningPRM800K (test)
Accuracy80
15
First-error detectionPRM800K
Accuracy92.9
6
Step-level hallucination detectionPRM800K
AUROC99.8
6
Stepwise Confidence AttributionPRM800K
AUROC0.8181
5
Math ReasoningPRM800K
AUC-ROC0.613
5
Instance-level EvaluationPRM800K
AUC-ROC0.42
1
Showing 6 of 6 rows