Learning Multi-Timescale Abstractions for Hierarchical Combinatorial Planning

About

The combination of exponentially large action spaces, stochastic dynamics, and long-horizon decision-making under limited resources makes Sequential Stochastic Combinatorial Optimization (SSCO) particularly challenging for reinforcement learning. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but it places the high-level policy in a Semi-Markov Decision Process (SMDP) where actions have variable durations, making it difficult to learn a world model that is suitable for planning. We introduce a model-based hierarchical framework for sequential stochastic combinatorial decision-making that directly addresses this issue. Our method combines a latent-space tree-search planner with an SMDP-aware world model for variable-duration decisions. A multi-timescale objective structures the latent dynamics so that transition magnitudes reflect the effective temporal scales of abstract actions, enabling efficient lookahead under adaptive temporal abstraction. We further learn a subgoal-conditioned budget policy jointly with the world model to support context-aware resource allocation. Across challenging SSCO benchmarks, our method outperforms strong baselines.

Vivienne Huiling Wang, Tinghuai Wang, Joni Pajarinen• 2026

Related benchmarks

Task	Dataset	Result
Stochastic route planning	SOP (Stochastic Orienteering Problem) N = 500 nodes	Mean Reward31.6	20
Adaptive Influence Maximization	AIM (N=500 graphs, T=10, K=50) (test)	Average Reward286.1	9
Adaptive Influence Maximization	AIM (N=500 graphs, T=10, K=60) (test)	Average Reward303.5	9
Adaptive Influence Maximization	AIM (N=500 graphs, T=10, K=70) (test)	Average Reward324.1	9
Adaptive Influence Maximization	AIM (N=500 graphs, T=20, K=10) (test)	Average Reward161.7	9
Influence Maximization	AIM N=1000 (test)	Average Reward416.1	9
Influence Maximization	AIM N=1500 (test)	Average Reward540.8	9
Influence Maximization	AIM N=2000 (test)	Avg. Reward605.5	9
Influence Maximization	AIM N=2500 (test)	Avg. Reward654.3	9
Planning	AIM N=500, K=70	Performance324.1	5

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord