Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Monotone and Conservative Policy Iteration Beyond the Tabular Case

About

We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI's guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.

S.R. Eshwar, Gugan Thoppe, Ananyabrata Barua, Aditya Gopalan, Gal Dalal• 2025

Related benchmarks

TaskDatasetResultRank
Reinforcement LearningWalker2D v5
Average Return319.9
45
Continuous ControlWalker2D v5
Avg Return319.9
17
Continuous ControlSwimmer v5
Terminal Performance22.3
2
Continuous ControlHopper v5
Terminal Performance270.9
2
Continuous ControlHalfcheetah v5
Terminal Performance632
2
Continuous ControlAnt v5
Terminal Performance208.2
2
Reinforcement LearningSwimmer
Terminal Performance22.3
2
Reinforcement LearningHopper v5
Terminal Performance270.9
2
Reinforcement LearningHalfcheetah
Terminal Performance632
2
Reinforcement LearningAnt v5
Terminal Performance208.2
2
Showing 10 of 10 rows

Other info

Follow for update