Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Step-level Reasoning Verification on StrQA
Loading...
52.7
PR-AUC
Skywork-PRM-1.5B
11.308
22.054
32.8
43.546
Nov 9, 2025
PR-AUC
Updated 1mo ago
Evaluation Results
Method
Method
Links
PR-AUC
Skywork-PRM-1.5B
# Sample=Unk, Verifyin...
2025.11
52.7
Qwen2.5-Math-7B-PRM800k
# Sample=265K
2025.11
35.5
ReProbe, Attn+Logit, Qwen3-8B-anno
# Sample=32K
2025.11
34.7
Qwen2.5-Math-7B
# Sample=860K
2025.11
33.3
Universal-PRM-Qwen2.5-Math-7B
# Sample=690K
2025.11
32.1
RLHFlow-PRM-Deepseek-8B
# Sample=253K
2025.11
31.5
Qwen2.5-Math-PRM-7B
# Sample=860K, Verifyi...
2025.11
31.3
ReProbe, Hidden States, GPT-OSS-anno
# Sample=10.8K, Verify...
2025.11
30.2
Qwen2.5-Math-7B-PRM800K
# Sample=263K, Verifyi...
2025.11
28.2
RLHFlow-PRM-Mistral-8B
# Sample=273K
2025.11
26.1
MaxProb
# Sample=-
2025.11
25.2
Math-Shepherd-PRM-7B
# Sample=440K
2025.11
24.9
MaxEntropy
# Sample=-
2025.11
24.8
Skywork-PRM-1.5B
# Sample=Unk
2025.11
23.7
Perplexity
# Sample=-
2025.11
22.8
H4-Qwen2.5-PRM-1.5B-0.2
# Sample=369K
2025.11
21.2
Universal-PRM-Qwen2.5-Math-7B
# Sample=690K, Verifyi...
2025.11
18.9
Random
# Sample=-
2025.11
17.2
H4-Qwen2.5-PRM-1.5B-0.2
# Sample=369K, Verifyi...
2025.11
12.9
Feedback
Search any
task
Search any
task