Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning Multi-Timescale Abstractions for Hierarchical Combinatorial Planning

About

The combination of exponentially large action spaces, stochastic dynamics, and long-horizon decision-making under limited resources makes Sequential Stochastic Combinatorial Optimization (SSCO) particularly challenging for reinforcement learning. Hierarchical Reinforcement Learning (HRL) offers a natural decomposition, but it places the high-level policy in a Semi-Markov Decision Process (SMDP) where actions have variable durations, making it difficult to learn a world model that is suitable for planning. We introduce a model-based hierarchical framework for sequential stochastic combinatorial decision-making that directly addresses this issue. Our method combines a latent-space tree-search planner with an SMDP-aware world model for variable-duration decisions. A multi-timescale objective structures the latent dynamics so that transition magnitudes reflect the effective temporal scales of abstract actions, enabling efficient lookahead under adaptive temporal abstraction. We further learn a subgoal-conditioned budget policy jointly with the world model to support context-aware resource allocation. Across challenging SSCO benchmarks, our method outperforms strong baselines.

Vivienne Huiling Wang, Tinghuai Wang, Joni Pajarinen• 2026

Related benchmarks

TaskDatasetResultRank
Stochastic route planningSOP (Stochastic Orienteering Problem) N = 500 nodes
Mean Reward31.6
20
Adaptive Influence MaximizationAIM (N=500 graphs, T=10, K=50) (test)
Average Reward286.1
9
Adaptive Influence MaximizationAIM (N=500 graphs, T=10, K=60) (test)
Average Reward303.5
9
Adaptive Influence MaximizationAIM (N=500 graphs, T=10, K=70) (test)
Average Reward324.1
9
Adaptive Influence MaximizationAIM (N=500 graphs, T=20, K=10) (test)
Average Reward161.7
9
Influence MaximizationAIM N=1000 (test)
Average Reward416.1
9
Influence MaximizationAIM N=1500 (test)
Average Reward540.8
9
Influence MaximizationAIM N=2000 (test)
Avg. Reward605.5
9
Influence MaximizationAIM N=2500 (test)
Avg. Reward654.3
9
PlanningAIM N=500, K=70
Performance324.1
5
Showing 10 of 20 rows

Other info

Follow for update