Towards Robust Process Reward Modeling via Noise-aware Learning

About

Process Reward Models (PRMs) have achieved strong results in complex reasoning, but are bottlenecked by costly process-level supervision. A widely used alternative, Monte Carlo Estimation (MCE), defines process rewards as the probability that a policy model reaches the correct final answer from a given reasoning step. However, step correctness is an intrinsic property of the reasoning trajectory, and should be invariant to policy choice. Our empirical findings show that MCE producing policy-dependent rewards that induce label noise, including false positives that reward incorrect steps and false negatives that penalize correct ones. To address above challenges, we propose a two-stage framework to mitigate noisy supervision. In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge to detect reflection and self-correction behaviors related to the current reasoning step, thereby suppressing overestimated rewards. In the training stage, we further propose a \underline{\textbf{N}}oise-\underline{\textbf{A}}ware \underline{\textbf{I}}terative \underline{\textbf{T}}raining framework that enables the PRM to progressively refine noisy labels based on its own confidence. Extensive Experiments show that our method substantially improves step-level correctness discrimination, achieving up to a 27\% absolute gain in average F1 over PRMs trained with noisy supervision.

Bin Xie, Bingbing Xu, Xueyun Tian, Yilin Chen, Huawei Shen• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Accuracy82.1	535
Mathematical Reasoning	Minerva Math	Accuracy52.7	228
Mathematical Reasoning	Gaokao	Accuracy73.2	51
Step-level Correctness Discrimination	ProcessBench GSM8K MATH Olympiad Bench Omni Math	GSM8K Error Rate0.332	12

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord