GRPO is Secretly a Process Reward Model

About

Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs tuned with $\lambda$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.

Michael Sullivan, Alexander Koller• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	--	514
Mathematical Reasoning	OlympiadBench	--	72
Mathematical Reasoning	AMC23 (val)	Accuracy80	24
Mathematical Reasoning	AIME 2024 (val)	Pass@1 Success Rate40	18
Mathematical Reasoning	Minerva	Exact Match Accuracy29.78	10

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord