Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Towards Robust Process Reward Modeling via Noise-aware Learning

About

Process Reward Models (PRMs) have achieved strong results in complex reasoning, but are bottlenecked by costly process-level supervision. A widely used alternative, Monte Carlo Estimation (MCE), defines process rewards as the probability that a policy model reaches the correct final answer from a given reasoning step. However, step correctness is an intrinsic property of the reasoning trajectory, and should be invariant to policy choice. Our empirical findings show that MCE producing policy-dependent rewards that induce label noise, including false positives that reward incorrect steps and false negatives that penalize correct ones. To address above challenges, we propose a two-stage framework to mitigate noisy supervision. In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge to detect reflection and self-correction behaviors related to the current reasoning step, thereby suppressing overestimated rewards. In the training stage, we further propose a \underline{\textbf{N}}oise-\underline{\textbf{A}}ware \underline{\textbf{I}}terative \underline{\textbf{T}}raining framework that enables the PRM to progressively refine noisy labels based on its own confidence. Extensive Experiments show that our method substantially improves step-level correctness discrimination, achieving up to a 27\% absolute gain in average F1 over PRMs trained with noisy supervision.

Bin Xie, Bingbing Xu, Xueyun Tian, Yilin Chen, Huawei Shen• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy82.1
535
Mathematical ReasoningMinerva Math
Accuracy52.7
100
Mathematical ReasoningGaokao
Accuracy73.2
51
Step-level Correctness DiscriminationProcessBench GSM8K MATH Olympiad Bench Omni Math
GSM8K Error Rate0.332
12
Showing 4 of 4 rows

Other info

Follow for update