DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
About
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Function Calling | BFCL V3 | Overall Accuracy62.51 | 104 | |
| Mathematical Reasoning | OLY | -- | 91 | |
| Mathematical Reasoning | AMC | Acc@879.52 | 27 | |
| Mathematical Reasoning | AIME24 | Accuracy35 | 18 | |
| Mathematical Reasoning | AIME25 | Accuracy (ACC)27.5 | 18 | |
| Mathematical Reasoning | AMC23 | Accuracy (ACC)71.23 | 18 | |
| Mathematical Reasoning | OlympiadBench OE_TO_mat_en_COMP | Accuracy57.73 | 18 | |
| Mathematical Reasoning | AIME24 | ACC/maj@843.3 | 18 | |
| Mathematical Reasoning | MATH500 | Accuracy (MATH500)89.55 | 18 | |
| Mathematical Reasoning | MATH | ACC (maj@8)92.6 | 18 |