Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

About

To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.

Jiashu Yao, Heyan Huang, Chuwei Luo, Daiqing Wu, Zeming Liu, Yuhang Guo, Yangyang Kang• 2026

Related benchmarks

TaskDatasetResultRank
General Knowledge ReasoningMMLU-Pro
Accuracy (MMLU-Pro)75.41
27
Mathematical ReasoningAIME 24/25
Accuracy71.88
27
Science ReasoningGPQA Diam
Accuracy58.9
27
Mathematical ReasoningMATH 500
Accuracy94.4
27
Creative WritingCreative writing (test)
Creativity89.23
20
Showing 5 of 5 rows

Other info

Follow for update