More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
About
We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy92.1 | 1362 | |
| Mathematical Reasoning | MATH | Accuracy65.5 | 882 | |
| Mathematical Reasoning | CollegeMATH | Accuracy15.5 | 276 | |
| Mathematical Reasoning | OLY | Accuracy32.7 | 91 | |
| Mathematical Reasoning | ProcessBench MATH 1.0 (test) | Accuracy88.4 | 10 | |
| Mathematical Reasoning | ProcessBench GSM8K 1.0 (test) | Accuracy94.2 | 10 | |
| Mathematical Reasoning | ProcessBench (OlympiaBench) 1.0 (test) | Accuracy77.2 | 10 |