EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

About

Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop -- repeatedly sampling and rewarding dominant modes -- that further erodes exploration. We introduce Exploration-Enhanced Policy Optimization (EEPO), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.

Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schutze, Kam-Fai Wong• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 24	Accuracy6.7	358
Mathematical Reasoning	Olympiad Bench	Accuracy29.3	254
Mathematical Reasoning	Minerva Math	Accuracy39.3	251
Mathematical Reasoning	Minerva Math	Accuracy20.6	228
Mathematical Reasoning	OlympiadBench	Accuracy50.1	213
Mathematical Reasoning	AIME 25	Pass@1 Accuracy30	190
Mathematical Reasoning	Minerva Math	Accuracy41.5	124
Mathematical Reasoning	AMC 23	Accuracy35	113
Mathematical Reasoning	AMC 23	Pass@1 Accuracy62.5	109
Mathematical Reasoning	Reasoning Benchmarks Average	Average Accuracy44.7	12

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord