Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Step-level Reasoning Verification on GSM8k
Loading...
93
PR-AUC
Qwen2.5-Math-PRM-7B
0.232
24.316
48.4
72.484
Nov 9, 2025
PR-AUC
Updated 1mo ago
Evaluation Results
Method
Method
Links
PR-AUC
Qwen2.5-Math-PRM-7B
# Sample=860K, Verifyi...
2025.11
93
Universal-PRM-Qwen2.5-Math-7B
# Sample=690K, Verifyi...
2025.11
90.3
Qwen2.5-Math-7B-PRM800K
# Sample=263K, Verifyi...
2025.11
89.4
Skywork-PRM-1.5B
# Sample=Unk, Verifyin...
2025.11
88.6
ReProbe, Hidden States, GPT-OSS-anno
# Sample=10.8K, Verify...
2025.11
53.5
H4-Qwen2.5-PRM-1.5B-0.2
# Sample=369K, Verifyi...
2025.11
46.4
Qwen2.5-Math-7B-PRM800k
# Sample=265K
2025.11
40.6
Qwen2.5-Math-7B
# Sample=860K
2025.11
37.7
ReProbe, Attn+Logit, Qwen3-8B-anno
# Sample=32K
2025.11
34
RLHFlow-PRM-Deepseek-8B
# Sample=253K
2025.11
26.3
Universal-PRM-Qwen2.5-Math-7B
# Sample=690K
2025.11
21.3
RLHFlow-PRM-Mistral-8B
# Sample=273K
2025.11
19.5
Math-Shepherd-PRM-7B
# Sample=440K
2025.11
18.8
Skywork-PRM-1.5B
# Sample=Unk
2025.11
18.1
MaxProb
# Sample=-
2025.11
8.4
MaxEntropy
# Sample=-
2025.11
7.9
Perplexity
# Sample=-
2025.11
6.6
H4-Qwen2.5-PRM-1.5B-0.2
# Sample=369K
2025.11
6.1
Random
# Sample=-
2025.11
3.8
Feedback
Search any
task
Search any
task