Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

About

Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emph{RLEP}\, -- \,Reinforcement Learning with Experience rePlay\, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.

Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, Guorui Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal ReasoningWeMath
Accuracy62.48
171
Multimodal ReasoningMathVision
Accuracy54.23
162
Multimodal ReasoningMathVerse
Accuracy58.91
130
Multimodal ReasoningMMBench
Accuracy90.45
127
Multimodal ReasoningMMStar
Accuracy72.27
78
Mathematical ReasoningBRUMO25
Accuracy28.9
62
Mathematical ReasoningHMMT 25
Accuracy (HMMT 25)10.8
50
Mathematical ReasoningAIME 25
Accuracy23
48
Mathematical ReasoningBeyond AIME
Accuracy10.6
45
Multi-modal ReasoningMMMU-Pro
Accuracy55.38
36
Showing 10 of 14 rows

Other info

Follow for update