AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

About

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.

Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	CMath	Accuracy77.6	63
Code Generation	CodeContests	Accuracy17.3	30
STEM Reasoning	MMLU STEM	Accuracy (STEM)58.1	15
STEM Reasoning	OCW	Accuracy17.3	7
STEM Reasoning	SAT	Accuracy0.893	7
STEM Reasoning	GaokaoCloze	Accuracy25.9	7
STEM Reasoning	GaokaoQA	Accuracy41.1	7

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord