SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

About

Reinforcement learning (RL) is a key paradigm for post-training large language models (LLMs), but the widely used Group Relative Policy Optimization (GRPO) often suffers from entropy collapse: exploration quickly disappears, policies converge prematurely, and sample diversity declines, ultimately harming training effectiveness. Existing remedies, including entropy bonuses and clip-based methods, rarely keep entropy within a stable exploration regime and often introduce oscillatory entropy or reward degradation. In this work, we identify a previously overlooked asymmetry in entropy dynamics: under high-temperature sampling, positive and negative samples have opposite effects on policy entropy. Specifically, high-temperature positive samples promote entropy growth, whereas negative samples suppress it. We provide a theoretical explanation for this phenomenon: when entropy decreases during policy updates, its derivative with respect to temperature is strictly positive under positive-sample updates, indicating that high-temperature positive samples can counteract entropy decay, thereby slowing entropy collapse and potentially reversing it. Motivated by this insight, we propose SCOPE-RL, a stable and quantitative entropy control framework through a regularization term constructed from temperature-adaptive positive samples. Extensive experiments show that SCOPE-RL consistently outperforms strong RL baselines on both Pass@1 and Pass@$k$. Our results provide evidence that escaping entropy collapse can improve reasoning performance, while also showing that the benefit is non-monotonic, with an optimal level of exploration for RL post-training in reasoning LLMs.

Chen Wang, Zhaochun Li, Jionghao Bai, Hexuan Deng, Ge Lan, Yue Wang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Accuracy54.5	370
Mathematical Reasoning	AIME 2024	Accuracy56.5	220
Mathematical Reasoning	AIME 2025	Accuracy43.7	214
Mathematical Reasoning	HMMT 2025	--	194
Knowledge Reasoning	MMLU-Pro	Accuracy76.1	120
Mathematical Reasoning	Minerva Math	Accuracy48.9	104
Mathematical Reasoning	Olympiad Math	Accuracy61.5	35
Reasoning	ARC Challenge	Accuracy (ARC)0.943	34
Knowledge-intensive reasoning	SuperGPQA	Overall Score38.6	31
Mathematical Reasoning	AIME 2025	Pass@12876.7	6

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord