A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation
About
Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms' (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy correction (importance weight) from the policy update constraint (trust region). However, the proximal policy requires an extra forward pass through the model at each training step, creating a computational overhead for large language models training. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, accelerating training by 1.8x speedup while maintaining comparable performance. Code \& off-the-shelf example are contributed to the open-source RL training system AReaL at: https://github.com/inclusionAI/AReaL/blob/v1.0.0.rc1/docs/algorithms/prox_approx.md
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2024 (test) | -- | 209 | |
| Agentic Task | VitaBench In-store | Avg@231.47 | 8 | |
| Agentic Task | τ2-bench Retail | Avg@465.8 | 8 | |
| Agentic Task | τ2-bench Telecom | Avg@2 Score44 | 8 | |
| Agentic Task | τ2-bench Airline | Avg@454 | 8 | |
| Agentic Task | VitaBench Delivery | Avg@220.74 | 8 | |
| Mathematical Reasoning | DAPO-Math-17k (test) | Final Eval Reward0.623 | 3 | |
| Mathematical Reasoning | GSM8K (test) | Final Reward0.791 | 3 |