CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

About

Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}oordinating \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy (Acc)91	600
Mathematical Reasoning	MATH 500	Accuracy76.7	589
Mathematical Reasoning	AIME 2024	Accuracy42	525
Mathematical Reasoning	AIME 2024	Accuracy35.1	394
Mathematical Reasoning	AIME 2025	Accuracy27.7	378
Mathematical Reasoning	HMMT 2025	Accuracy67.3	241
Mathematical Reasoning	AMC 2023	Accuracy85.9	144
Mathematical Reasoning	OlympiadBench	Accuracy45.6	134
Math Reasoning	MATH500	Accuracy95.6	127
Math Reasoning	AMC23	Pass@1 Accuracy93.8	99

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord