Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

About

Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $\epsilon$-stationary point in $\tilde{\mathcal{O}}(\epsilon^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(\epsilon^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.

Zihao Zheng, Irwin King, Songtao Lu• 2026

Related benchmarks

Task	Dataset	Result	Rank
Bilevel Reinforcement Learning	Bilevel Reinforcement Learning LL Problem: Max	--		6
Bilevel Reinforcement Learning	Bilevel Optimization over Saddle Points LL Problem: Min-Max	Iteration Complexity1		3

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord