Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

About

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.

Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningCMath
Accuracy77.6
63
Code GenerationCodeContests
Accuracy17.3
30
STEM ReasoningMMLU STEM
Accuracy (STEM)58.1
15
STEM ReasoningOCW
Accuracy17.3
7
STEM ReasoningSAT
Accuracy0.893
7
STEM ReasoningGaokaoCloze
Accuracy25.9
7
STEM ReasoningGaokaoQA
Accuracy41.1
7
Showing 7 of 7 rows

Other info

Follow for update