SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training
About
Reinforcement learning (RL) is a key paradigm for post-training large language models (LLMs), but the widely used Group Relative Policy Optimization (GRPO) often suffers from entropy collapse: exploration quickly disappears, policies converge prematurely, and sample diversity declines, ultimately harming training effectiveness. Existing remedies, including entropy bonuses and clip-based methods, rarely keep entropy within a stable exploration regime and often introduce oscillatory entropy or reward degradation. In this work, we identify a previously overlooked asymmetry in entropy dynamics: under high-temperature sampling, positive and negative samples have opposite effects on policy entropy. Specifically, high-temperature positive samples promote entropy growth, whereas negative samples suppress it. We provide a theoretical explanation for this phenomenon: when entropy decreases during policy updates, its derivative with respect to temperature is strictly positive under positive-sample updates, indicating that high-temperature positive samples can counteract entropy decay, thereby slowing entropy collapse and potentially reversing it. Motivated by this insight, we propose SCOPE-RL, a stable and quantitative entropy control framework through a regularization term constructed from temperature-adaptive positive samples. Extensive experiments show that SCOPE-RL consistently outperforms strong RL baselines on both Pass@1 and Pass@$k$. Our results provide evidence that escaping entropy collapse can improve reasoning performance, while also showing that the benefit is non-monotonic, with an optimal level of exploration for RL post-training in reasoning LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2024 | Accuracy54.5 | 370 | |
| Mathematical Reasoning | AIME 2024 | Accuracy56.5 | 220 | |
| Mathematical Reasoning | AIME 2025 | Accuracy43.7 | 214 | |
| Mathematical Reasoning | HMMT 2025 | -- | 194 | |
| Knowledge Reasoning | MMLU-Pro | Accuracy76.1 | 120 | |
| Mathematical Reasoning | Minerva Math | Accuracy48.9 | 104 | |
| Mathematical Reasoning | Olympiad Math | Accuracy61.5 | 35 | |
| Reasoning | ARC Challenge | Accuracy (ARC)0.943 | 34 | |
| Knowledge-intensive reasoning | SuperGPQA | Overall Score38.6 | 31 | |
| Mathematical Reasoning | AIME 2025 | Pass@12876.7 | 6 |