Holder Policy Optimisation
About
Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{H\"{o}lderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the H\"{o}lder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH500 (test) | -- | 895 | |
| Mathematical Reasoning | MATH 500 | Top-1 Accuracy92.6 | 384 | |
| Mathematical Reasoning | Minerva | Pass@1 Accuracy42.3 | 289 | |
| Mathematical Reasoning | AMC | Pass@1 Accuracy79.5 | 119 | |
| Mathematical Reasoning | AIME 25 | -- | 112 | |
| Mathematical Reasoning | AIME 24 | Pass@1 Accuracy53.3 | 103 | |
| Mathematical Reasoning | AMC23 (test) | Pass@160.2 | 61 | |
| Mathematical Reasoning | Minerva (test) | -- | 46 | |
| Mathematical Reasoning | Olympiad | Pass@1 Accuracy50.3 | 35 | |
| Mathematical Reasoning | AIME25 (test) | -- | 33 |