Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RePO: Replay-Enhanced Policy Optimization

About

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.

Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, Chaochao Lu• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy83.75
442
Mathematical ReasoningAMC
Accuracy64.76
221
Mathematical ReasoningAIME24
Accuracy30.42
160
Scientific ReasoningARC Challenge--
94
Mathematical ReasoningOlympiad
Accuracy45.44
90
General ReasoningMMLU-Pro
pass@1 Accuracy42.5
69
General ReasoningMMLU-Pro
MMLU-Pro General Reasoning Avg@8 Acc53.15
63
Mathematical ReasoningMinerva
Pass@835.71
24
Scientific ReasoningGPQA Diamond
pass@124.2
19
Mathematical ReasoningMATH500
Accuracy (k=8)86.2
15
Showing 10 of 16 rows

Other info

Follow for update