Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization

About

In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent advantages. This operator enables stable off-policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q-function estimation. To further refine estimation, we introduce a double-truncated importance sampling ratio scheme. This scheme improves credit assignment for off-policy trajectories by balancing sensitivity to the agent's own policy changes with robustness to non-stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.

Seongmin Kim, Giseung Park, Woojun Kim, Jiwon Jeon, Seungyul Han, Youngchul Sung• 2026

Related benchmarks

Task	Dataset	Result
5m vs 6m	SMAC	Win Rate93.7	13
6h vs 8z	SMAC	Win Rate99.8	12
3s5z vs 3s6z	SMAC	Win Rate87.3	12
10m vs 11m	SMAC	Win Rate98.5	12
smacv2_10_units	SMAX SMACv2	Average Win Rate75	7
smacv2_5_units	SMAX SMAC v2	Average Win Rate81	7
ant-4x2	MABrax	Episode Return3.57e+3	5
ant-8x1	MABrax	Episode Return3.29e+3	5
halfcheetah-6x1	MABrax	Episode Return3.46e+3	5
hopper-3x1	MABrax	Episode Return1.57e+3	5

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord