Step-level reasoning evaluation

Benchmarks

Dataset Name	SOTA Method	Metric
PDDL (test)	Llama-3.1-8B-PRM800k-PDDL-r	F1 Score94.5	20	3mo ago
Rooms (test)	Qwen2.5-Math-7B-PRM800k	Error Rate0	10	3mo ago
PRMBench step-level 2025	PRISM	FPR47.13	5	1mo ago

Showing 3 of 3 rows