Monotone and Conservative Policy Iteration Beyond the Tabular Case
About
We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI's guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reinforcement Learning | Walker2D v5 | Average Return319.9 | 45 | |
| Continuous Control | Walker2D v5 | Avg Return319.9 | 17 | |
| Continuous Control | Swimmer v5 | Terminal Performance22.3 | 2 | |
| Continuous Control | Hopper v5 | Terminal Performance270.9 | 2 | |
| Continuous Control | Halfcheetah v5 | Terminal Performance632 | 2 | |
| Continuous Control | Ant v5 | Terminal Performance208.2 | 2 | |
| Reinforcement Learning | Swimmer | Terminal Performance22.3 | 2 | |
| Reinforcement Learning | Hopper v5 | Terminal Performance270.9 | 2 | |
| Reinforcement Learning | Halfcheetah | Terminal Performance632 | 2 | |
| Reinforcement Learning | Ant v5 | Terminal Performance208.2 | 2 |