AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback
About
Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | CMath | Accuracy77.6 | 63 | |
| Code Generation | CodeContests | Accuracy17.3 | 30 | |
| STEM Reasoning | MMLU STEM | Accuracy (STEM)58.1 | 15 | |
| STEM Reasoning | OCW | Accuracy17.3 | 7 | |
| STEM Reasoning | SAT | Accuracy0.893 | 7 | |
| STEM Reasoning | GaokaoCloze | Accuracy25.9 | 7 | |
| STEM Reasoning | GaokaoQA | Accuracy41.1 | 7 |