Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Step-level reasoning evaluation on Rooms (test)
Loading...
0
Error Rate
Qwen2.5-Math-7B-PRM800k
-3.324
19.113
41.55
63.987
Apr 20, 2026
Error Rate
Accuracy
F1 Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
Error Rate
Accuracy
F1 Score
Qwen2.5-Math-7B-PRM800k
Model=Qwen2.5-Math-7B-...
2026.04
0
100
0
Qwen2.5-Math-PRM-7B
Model=Qwen2.5-Math-PRM-7B
2026.04
9.5
100
17.3
Skywork-PRM-7B
Model=Skywork-PRM-7B
2026.04
14.7
0
0
Qwen2.5-Math-7B-MathShepherd-r
Model=Qwen2.5-Math-7B-...
2026.04
14.7
0
0
Qwen2.5-Math-7B-PRM800k-r
Model=Qwen2.5-Math-7B-...
2026.04
14.8
100
25.7
Llama-3.1-8B-PRM800k-r
Model=Llama-3.1-8B-PRM...
2026.04
46.9
99.5
63.7
Qwen2.5-Math-7B-PRM800k-PDDL-r
Model=Qwen2.5-Math-7B-...
2026.04
75.4
86.8
80.7
Llama-3.1-8B-PRM800k-PDDL-r
Model=Llama-3.1-8B-PRM...
2026.04
75.7
82.9
79.1
Qwen2.5-Math-7B-MathShepherd-PDDL-r
Model=Qwen2.5-Math-7B-...
2026.04
81.3
99.9
89.6
Qwen2.5-Math-PRM-7B-PDDL-r
Model=Qwen2.5-Math-PRM...
2026.04
83.1
98.9
90.3
Feedback
Search any
task
Search any
task