Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Step-level reasoning verification on SciQA
Loading...
44
PR-AUC
RLHFlow-PRM-Deepseek-8B
7.184
16.742
26.3
35.858
Nov 9, 2025
PR-AUC
Updated 1mo ago
Evaluation Results
Method
Method
Links
PR-AUC
RLHFlow-PRM-Deepseek-8B
# Sample=253K
2025.11
44
Skywork-PRM-1.5B
# Sample=Unk
2025.11
41.5
Qwen2.5-Math-7B-PRM800K
# Sample=263K, Verifyi...
2025.11
35
ReProbe, Attn+Logit, Qwen3-8B-anno
# Sample=32K
2025.11
34.7
Qwen2.5-Math-7B-PRM800k
# Sample=265K
2025.11
32.9
Math-Shepherd-PRM-7B
# Sample=440K
2025.11
32.7
Qwen2.5-Math-PRM-7B
# Sample=860K, Verifyi...
2025.11
31.5
RLHFlow-PRM-Mistral-8B
# Sample=273K
2025.11
31.1
Qwen2.5-Math-7B
# Sample=860K
2025.11
31
Universal-PRM-Qwen2.5-Math-7B
# Sample=690K, Verifyi...
2025.11
30.3
ReProbe, Hidden States, GPT-OSS-anno
# Sample=10.8K, Verify...
2025.11
30.3
Skywork-PRM-1.5B
# Sample=Unk, Verifyin...
2025.11
26.5
Universal-PRM-Qwen2.5-Math-7B
# Sample=690K
2025.11
25.2
H4-Qwen2.5-PRM-1.5B-0.2
# Sample=369K, Verifyi...
2025.11
22.1
MaxProb
# Sample=-
2025.11
15.8
Perplexity
# Sample=-
2025.11
14.3
MaxEntropy
# Sample=-
2025.11
13.5
H4-Qwen2.5-PRM-1.5B-0.2
# Sample=369K
2025.11
11.6
Random
# Sample=-
2025.11
8.6
Feedback
Search any
task
Search any
task