Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

About

To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.

Jiashu Yao, Heyan Huang, Chuwei Luo, Daiqing Wu, Zeming Liu, Yuhang Guo, Yangyang Kang• 2026

Related benchmarks

Task	Dataset	Result
General Knowledge Reasoning	MMLU-Pro	Accuracy (MMLU-Pro)75.41	27
Mathematical Reasoning	AIME 24/25	Accuracy71.88	27
Science Reasoning	GPQA Diam	Accuracy58.9	27
Mathematical Reasoning	MATH 500	Accuracy94.4	27
Creative Writing	Creative writing (test)	Creativity89.23	20

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord