Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization
About
To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| General Knowledge Reasoning | MMLU-Pro | Accuracy (MMLU-Pro)75.41 | 27 | |
| Mathematical Reasoning | AIME 24/25 | Accuracy71.88 | 27 | |
| Science Reasoning | GPQA Diam | Accuracy58.9 | 27 | |
| Mathematical Reasoning | MATH 500 | Accuracy94.4 | 27 | |
| Creative Writing | Creative writing (test) | Creativity89.23 | 20 |