Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

About

Reinforcement Learning with Verifiable Rewards (RLVR) improves final-answer accuracy on reasoning tasks, but it does not reliably improve reasoning quality. Because outcome rewards only assess final answers, they also reward spurious successes: flawed reasoning can still receive maximal reward when it accidentally reaches the correct outcome. This outcome reward hacking creates biased gradients, making current RLVR insufficient for learning faithful reasoning. Process Reward Models (PRMs) provide step-wise supervision, but directly optimizing PRMs or naively combining them with outcome rewards is unstable under distribution shift during RL training process. We introduce PRocess cOnsistency Filter (PROF), a data curation method that uses PRM--ORM consistency for sample selection rather than direct reward optimization. PROF keeps correct responses with strong process support and incorrect responses with weak process support while maintaining a balanced training ratio. Experiments show that PROF consistently improves both final-answer accuracy and intermediate reasoning quality over strong baselines, with less dependence on strong PRMs.

Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, Anurag Beniwal• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy @1624.8
81
Mathematical ReasoningHMMT Feb 2025--
45
Mathematical ReasoningMinerva
Pass@1 (Avg@16)41.7
32
Mathematical ReasoningAMC23
Avg@1669.1
29
Mathematical ReasoningMATH500
Average Accuracy @1683.1
15
Mathematical ReasoningMinerva Math
Accuracy (Avg@16)39
15
Mathematical ReasoningOlympiad Bench
Average@16 Accuracy47.8
15
Mathematical ReasoningAMC 2023
Average@16 Accuracy70.9
15
Mathematical ReasoningHmmt feb-2024
Average@1613.7
15
Mathematical ReasoningAIME 2025
Average@1619.6
15
Showing 10 of 13 rows

Other info

Follow for update