Learning Multi-Timescale Abstractions for Hierarchical Combinatorial Planning
About
The combination of exponentially large action spaces, stochastic dynamics, and long-horizon decision-making under limited resources makes Sequential Stochastic Combinatorial Optimization (SSCO) particularly challenging for reinforcement learning. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but it places the high-level policy in a Semi-Markov Decision Process (SMDP) where actions have variable durations, making it difficult to learn a world model that is suitable for planning. We introduce a model-based hierarchical framework for sequential stochastic combinatorial decision-making that directly addresses this issue. Our method combines a latent-space tree-search planner with an SMDP-aware world model for variable-duration decisions. A multi-timescale objective structures the latent dynamics so that transition magnitudes reflect the effective temporal scales of abstract actions, enabling efficient lookahead under adaptive temporal abstraction. We further learn a subgoal-conditioned budget policy jointly with the world model to support context-aware resource allocation. Across challenging SSCO benchmarks, our method outperforms strong baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Stochastic route planning | SOP (Stochastic Orienteering Problem) N = 500 nodes | Mean Reward31.6 | 20 | |
| Adaptive Influence Maximization | AIM (N=500 graphs, T=10, K=50) (test) | Average Reward286.1 | 9 | |
| Adaptive Influence Maximization | AIM (N=500 graphs, T=10, K=60) (test) | Average Reward303.5 | 9 | |
| Adaptive Influence Maximization | AIM (N=500 graphs, T=10, K=70) (test) | Average Reward324.1 | 9 | |
| Adaptive Influence Maximization | AIM (N=500 graphs, T=20, K=10) (test) | Average Reward161.7 | 9 | |
| Influence Maximization | AIM N=1000 (test) | Average Reward416.1 | 9 | |
| Influence Maximization | AIM N=1500 (test) | Average Reward540.8 | 9 | |
| Influence Maximization | AIM N=2000 (test) | Avg. Reward605.5 | 9 | |
| Influence Maximization | AIM N=2500 (test) | Avg. Reward654.3 | 9 | |
| Planning | AIM N=500, K=70 | Performance324.1 | 5 |