Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

About

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

Xiaofan Li, Ming Yang, Zhiyuan Ma, Shichao Ma, Jintao Du, Yu Cheng, Weiqiang Wang, Zhizhong Zhang, Xin Tan, Yanyun Qu, Lizhuang Ma, Yuan Xie• 2026

Related benchmarks

TaskDatasetResultRank
Function CallingBFCL V3
Overall Accuracy62.51
104
Mathematical ReasoningOLY--
91
Mathematical ReasoningAMC
Acc@879.52
27
Mathematical ReasoningAIME24
Accuracy35
18
Mathematical ReasoningAIME25
Accuracy (ACC)27.5
18
Mathematical ReasoningAMC23
Accuracy (ACC)71.23
18
Mathematical ReasoningOlympiadBench OE_TO_mat_en_COMP
Accuracy57.73
18
Mathematical ReasoningAIME24
ACC/maj@843.3
18
Mathematical ReasoningMATH500
Accuracy (MATH500)89.55
18
Mathematical ReasoningMATH
ACC (maj@8)92.6
18
Showing 10 of 16 rows

Other info

Follow for update