Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GRPO is Secretly a Process Reward Model

About

Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs tuned with $\lambda$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.

Michael Sullivan, Alexander Koller• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH500 (test)--
514
Mathematical ReasoningOlympiadBench--
72
Mathematical ReasoningAMC23 (val)
Accuracy80
24
Mathematical ReasoningAIME 2024 (val)
Pass@1 Success Rate40
18
Mathematical ReasoningMinerva
Exact Match Accuracy29.78
10
Showing 5 of 5 rows

Other info

Follow for update