REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

About

Reinforcement Learning from Human Feedback~(RLHF) plays a crucial role in aligning Large Language Models~(LLMs). The dominant algorithm, Proximal Policy Optimization~(PPO), employs a critic network to estimate advantages, which introduces significant computational and memory overhead. To address this, a family of critic-free algorithms (e.g., GRPO, RLOO) has emerged. However, these methods typically rely on \textit{prompt-level (local)} advantage normalization, which suffers from inaccurate advantage estimation, a tendency to overfit, and, as we show, is a theoretically biased estimator. To solve these challenges, we introduce REINFORCE++, a critic-free framework centered on \textbf{Global Advantage Normalization}. By normalizing advantages across the entire global batch rather than small, prompt-specific groups, our method provides a more stable and theoretically sound, \textit{effectively unbiased} estimate (whose bias vanishes as batch size increases). We introduce two variants: REINFORCE++, a highly efficient and general algorithm ($k \ge 1$) for general-domain RLHF, and REINFORCE++ /w baseline, a robust group-sampling variant ($k > 1$) for complex reasoning tasks. Our empirical evaluation demonstrates that each variant shows superior stability and performance in its respective domain, outperforming existing methods and even PPO in complex agentic settings.

Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy92.2	1398
Mathematical Reasoning	GSM8K (test)	Accuracy57.3	954
Mathematical Reasoning	MATH	Accuracy88.8	882
Mathematical Reasoning	MATH	Accuracy88.8	535
Mathematical Reasoning	AIME 2024	Accuracy35	479
Mathematical Reasoning	AIME 2024	Accuracy23.9	370
Mathematical Reasoning	CollegeMATH	Accuracy41.3	327
Mathematical Reasoning	AIME 2025	Accuracy28.1	311
Question Answering	2Wiki	--	241
Question Answering	Bamboogle	--	227

Showing 10 of 120 rows

...

Other info

Follow for update

@wizwand_team Discord