Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Efficient Process Reward Modeling via Contrastive Mutual Information

About

Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational resources due to repeated LLM rollouts. To overcome these limitations, we propose contrastive pointwise mutual information (CPMI), a novel automatic reward labeling method that leverages the model's internal probability to infer step-level supervision while significantly reducing the computational burden of annotating dataset. CPMI quantifies how much a reasoning step increases the mutual information between the step and the correct target answer relative to hard-negative alternatives. This contrastive signal serves as a proxy for the step's contribution to the final solution and yields a reliable reward. The experimental results show that CPMI-based labeling reduces dataset construction time by 84% and token generation by 98% compared to MC estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.

Nakyung Lee, Sangwoo Hong, Jungwoo Lee• 2026

Related benchmarks

TaskDatasetResultRank
Logical reasoningFOLIO
Accuracy59.61
123
Logical reasoningLogiQA-2
Accuracy37.71
34
Logical reasoningLogicNLI
Accuracy32.1
11
Process-level EvaluationProcessBench GSM8K
F1 Score52
7
Process-level EvaluationProcessBench Olympiad
F1 Score28.7
7
Process-level EvaluationProcessBench Omni
F1 Score25.6
7
Process-level EvaluationProcessBench Average
Mean F136.8
7
Process-level EvaluationPROCESSBENCH MATH
F1 Score40.8
7
Step-level quality assessmentPRMBENCH
Simplicity54.65
5
Computation Cost AnalysisReasoning Samples 10K
Time Ratio0.16
4
Showing 10 of 13 rows

Other info

Follow for update