Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction

About

Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected through a stochastic behavior policy. We present Latent Macro Action Planner (L-MAP), which addresses this challenge by learning a set of temporally extended macro-actions through a state-conditional Vector Quantized Variational Autoencoder (VQ-VAE), effectively reducing action dimensionality. L-MAP employs a (separate) learned prior model that acts as a latent transition model and allows efficient sampling of plausible actions. During planning, our approach accounts for stochasticity in both the environment and the behavior policy by using Monte Carlo tree search (MCTS). In offline RL settings, including stochastic continuous control tasks, L-MAP efficiently searches over discrete latent actions to yield high expected returns. Empirical results demonstrate that L-MAP maintains low decision latency despite increased action dimensionality. Notably, across tasks ranging from continuous control with inherently stochastic dynamics to high-dimensional robotic hand manipulation, L-MAP significantly outperforms existing model-based methods and performs on-par with strong model-free actor-critic baselines, highlighting the effectiveness of the proposed approach in planning in complex and stochastic environments with high-dimensional action spaces.

Baiting Luo, Ava Pettet, Aron Laszka, Abhishek Dubey, Ayan Mukhopadhyay• 2025

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement Learninghopper medium
Normalized Score71.08
52
Offline Reinforcement Learningwalker2d medium
Normalized Score75.77
51
Offline Reinforcement Learningwalker2d medium-replay
Normalized Score71.98
50
Offline Reinforcement Learninghopper medium-replay
Normalized Score83.99
44
Offline Reinforcement LearningWalker2d medium-expert
Normalized Score96.53
31
Offline Reinforcement LearningHopper medium-expert
Normalized Score94.65
24
Offline Reinforcement LearningHopper Medium Noise 0
Normalized Return90.8
14
Offline Reinforcement LearningHopper Medium (Noise 5)
Normalized Return60.76
14
Offline Reinforcement LearningHopper Medium-Expert Noise 5
Normalized Return0.7149
7
Offline Reinforcement LearningHopper Medium-Expert (Noise 0)
Normalized Return106.7
7
Showing 10 of 27 rows

Other info

Follow for update