RePO: Replay-Enhanced Policy Optimization

About

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.

Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, Chaochao Lu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy83.75	442
Mathematical Reasoning	AMC	Accuracy64.76	221
Mathematical Reasoning	AIME24	Accuracy30.42	160
Scientific Reasoning	ARC Challenge	--	115
General Reasoning	MMLU-Pro	pass@1 Accuracy42.5	93
Mathematical Reasoning	Olympiad	Accuracy45.44	90
General Reasoning	MMLU-Pro	MMLU-Pro General Reasoning Avg@8 Acc53.15	63
Mathematical Reasoning	Minerva	Pass@835.71	24
Scientific Reasoning	GPQA Diamond	pass@124.2	19
Mathematical Reasoning	Mathematics In-Distribution (AIME25, AMC23, Minerva, MATH500, Olymp.)	AIME25 Accuracy28.3	16

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord