Action-Gradient Monte Carlo Tree Search for Non-Parametric Continuous (PO)MDPs
About
Online planning in continuous state, action, and observation spaces remains challenging for autonomous systems. While Monte Carlo Tree Search (MCTS) scales effectively via sampling, most continuous (PO)MDP solvers do not exploit gradient-based action optimization. We propose Action-Gradient MCTS (AGMCTS), a framework that combines global tree search with local gradient-based action refinement, while maintaining consistent value estimates. We provide three key theoretical contributions: (1) an action score gradient theorem for particle belief states; (2) the Multiple Importance Sampling (MIS) Tree that supports frequent action-branch updates by reusing prior samples without introducing estimator drift; and (3) tractable action score gradients for smooth generative models using the Area Formula. Empirical results demonstrate that AGMCTS outperforms state-of-the-art sample-based solvers in multiple challenging continuous MDP and POMDP benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Control Task | Lunar Lander (test) | Average Reward61.28 | 31 | |
| Continuous Control | Mountain Car POMDP | Mean Performance26.96 | 30 | |
| Hill Car POMDP | Hill Car POMDP | Mean Return87.58 | 30 | |
| Two-Agent 2D-Continuous Light-Dark Navigation | Two-Agent 2D-Continuous Light-Dark | Mean Performance2.84 | 30 | |
| POMDP Navigation | 4D-Continuous Light-Dark | Mean Return2.97 | 30 | |
| Planning | 3D-Continuous Light-Dark | Mean Return4.17 | 30 | |
| Reinforcement Learning | Lunar Lander POMDP | Performance Score52 | 30 | |
| POMDP Planning | 2D-Continuous Light-Dark (test) | Mean Return5.07 | 30 | |
| Mountain Car | Mountain Car | Mean Return29.97 | 20 | |
| Reinforcement Learning | Hill Car MDP | Performance56.68 | 20 |