DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

About

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

Xiaofan Li, Ming Yang, Zhiyuan Ma, Shichao Ma, Jintao Du, Yu Cheng, Weiqiang Wang, Zhizhong Zhang, Xin Tan, Yanyun Qu, Lizhuang Ma, Yuan Xie• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	OLY	--	105
Function Calling	BFCL V3	Overall Accuracy62.51	104
Mathematical Reasoning	AIME25	Accuracy (ACC)27.5	32
Mathematical Reasoning	AMC	Acc@879.52	27
Mathematical Reasoning	AIME24	Accuracy35	18
Mathematical Reasoning	AMC23	Accuracy (ACC)71.23	18
Mathematical Reasoning	OlympiadBench OE_TO_mat_en_COMP	Accuracy57.73	18
Mathematical Reasoning	AIME24	ACC/maj@843.3	18
Mathematical Reasoning	MATH500	Accuracy (MATH500)89.55	18
Mathematical Reasoning	MATH	ACC (maj@8)92.6	18

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord