Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization
About
In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent advantages. This operator enables stable off-policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q-function estimation. To further refine estimation, we introduce a double-truncated importance sampling ratio scheme. This scheme improves credit assignment for off-policy trajectories by balancing sensitivity to the agent's own policy changes with robustness to non-stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 5m vs 6m | SMAC | Win Rate93.7 | 13 | |
| 6h vs 8z | SMAC | Win Rate99.8 | 12 | |
| 3s5z vs 3s6z | SMAC | Win Rate87.3 | 12 | |
| 10m vs 11m | SMAC | Win Rate98.5 | 12 | |
| smacv2_10_units | SMAX SMACv2 | Average Win Rate75 | 7 | |
| smacv2_5_units | SMAX SMAC v2 | Average Win Rate81 | 7 | |
| ant-4x2 | MABrax | Episode Return3.57e+3 | 5 | |
| ant-8x1 | MABrax | Episode Return3.29e+3 | 5 | |
| halfcheetah-6x1 | MABrax | Episode Return3.46e+3 | 5 | |
| hopper-3x1 | MABrax | Episode Return1.57e+3 | 5 |