RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

About

Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emph{RLEP}\, -- \,Reinforcement Learning with Experience rePlay\, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.

Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, Guorui Zhou• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	WeMath	Accuracy62.48	199
Multimodal Reasoning	MMBench	Accuracy90.45	180
Multimodal Reasoning	MathVision	Accuracy54.23	162
Multimodal Reasoning	MathVerse	Accuracy58.91	138
Multimodal Reasoning	MMStar	Accuracy72.27	102
Mathematical Reasoning	BRUMO25	Accuracy28.9	89
Mathematical Reasoning	HMMT 25	Accuracy (HMMT 25)10.8	50
Mathematical Reasoning	AIME 25	Accuracy23	48
Mathematical Reasoning	Beyond AIME	Accuracy10.6	45
Image-Text Understanding	MMBench	Accuracy90.45	40

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord