Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Step-level reasoning verification on Trips
Loading...
82.5
PR-AUC
Qwen2.5-Math-7B-PRM800k
39.132
50.391
61.65
72.909
Nov 9, 2025
PR-AUC
Updated 1mo ago
Evaluation Results
Method
Method
Links
PR-AUC
Qwen2.5-Math-7B-PRM800k
# Sample=265K
2025.11
82.5
Qwen2.5-Math-7B
# Sample=860K
2025.11
79.1
ReProbe, Attn+Logit, Qwen3-8B-anno
# Sample=32K
2025.11
75.6
ReProbe, Hidden States, GPT-OSS-anno
# Sample=10.8K, Verify...
2025.11
75.2
Math-Shepherd-PRM-7B
# Sample=440K
2025.11
74.7
Universal-PRM-Qwen2.5-Math-7B
# Sample=690K
2025.11
74.1
Universal-PRM-Qwen2.5-Math-7B
# Sample=690K, Verifyi...
2025.11
73.7
Qwen2.5-Math-PRM-7B
# Sample=860K, Verifyi...
2025.11
69.6
MaxProb
# Sample=-
2025.11
61.8
MaxEntropy
# Sample=-
2025.11
58.5
Qwen2.5-Math-7B-PRM800K
# Sample=263K, Verifyi...
2025.11
57.3
RLHFlow-PRM-Deepseek-8B
# Sample=253K
2025.11
55.8
Perplexity
# Sample=-
2025.11
55.7
Random
# Sample=-
2025.11
55.2
H4-Qwen2.5-PRM-1.5B-0.2
# Sample=369K
2025.11
53.4
H4-Qwen2.5-PRM-1.5B-0.2
# Sample=369K, Verifyi...
2025.11
47
RLHFlow-PRM-Mistral-8B
# Sample=273K
2025.11
46.2
Skywork-PRM-1.5B
# Sample=Unk, Verifyin...
2025.11
42
Skywork-PRM-1.5B
# Sample=Unk
2025.11
40.8
Feedback
Search any
task
Search any
task