Monotone and Conservative Policy Iteration Beyond the Tabular Case

About

We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI's guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.

S.R. Eshwar, Gugan Thoppe, Ananyabrata Barua, Aditya Gopalan, Gal Dalal• 2025

Related benchmarks

Task	Dataset	Result
Reinforcement Learning	Walker2D v5	Average Return319.9	45
Continuous Control	Walker2D v5	Avg Return319.9	17
Continuous Control	Swimmer v5	Terminal Performance22.3	2
Continuous Control	Hopper v5	Terminal Performance270.9	2
Continuous Control	Halfcheetah v5	Terminal Performance632	2
Continuous Control	Ant v5	Terminal Performance208.2	2
Reinforcement Learning	Swimmer	Terminal Performance22.3	2
Reinforcement Learning	Hopper v5	Terminal Performance270.9	2
Reinforcement Learning	Halfcheetah	Terminal Performance632	2
Reinforcement Learning	Ant v5	Terminal Performance208.2	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord