Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

About

Reinforcement learning (RL) is a key paradigm for post-training large language models (LLMs), but the widely used Group Relative Policy Optimization (GRPO) often suffers from entropy collapse: exploration quickly disappears, policies converge prematurely, and sample diversity declines, ultimately harming training effectiveness. Existing remedies, including entropy bonuses and clip-based methods, rarely keep entropy within a stable exploration regime and often introduce oscillatory entropy or reward degradation. In this work, we identify a previously overlooked asymmetry in entropy dynamics: under high-temperature sampling, positive and negative samples have opposite effects on policy entropy. Specifically, high-temperature positive samples promote entropy growth, whereas negative samples suppress it. We provide a theoretical explanation for this phenomenon: when entropy decreases during policy updates, its derivative with respect to temperature is strictly positive under positive-sample updates, indicating that high-temperature positive samples can counteract entropy decay, thereby slowing entropy collapse and potentially reversing it. Motivated by this insight, we propose SCOPE-RL, a stable and quantitative entropy control framework through a regularization term constructed from temperature-adaptive positive samples. Extensive experiments show that SCOPE-RL consistently outperforms strong RL baselines on both Pass@1 and Pass@$k$. Our results provide evidence that escaping entropy collapse can improve reasoning performance, while also showing that the benefit is non-monotonic, with an optimal level of exploration for RL post-training in reasoning LLMs.

Chen Wang, Zhaochun Li, Jionghao Bai, Hexuan Deng, Ge Lan, Yue Wang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy54.5
370
Mathematical ReasoningAIME 2024
Accuracy56.5
220
Mathematical ReasoningAIME 2025
Accuracy43.7
214
Mathematical ReasoningHMMT 2025--
194
Knowledge ReasoningMMLU-Pro
Accuracy76.1
120
Mathematical ReasoningMinerva Math
Accuracy48.9
104
Mathematical ReasoningOlympiad Math
Accuracy61.5
35
ReasoningARC Challenge
Accuracy (ARC)0.943
34
Knowledge-intensive reasoningSuperGPQA
Overall Score38.6
31
Mathematical ReasoningAIME 2025
Pass@12876.7
6
Showing 10 of 13 rows

Other info

Follow for update