Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Latent Plan Transformer for Trajectory Abstraction: Planning as Latent Space Inference

About

In tasks aiming for long-term returns, planning becomes essential. We study generative modeling for planning with datasets repurposed from offline reinforcement learning. Specifically, we identify temporal consistency in the absence of step-wise rewards as one key technical challenge. We introduce the Latent Plan Transformer (LPT), a novel model that leverages a latent variable to connect a Transformer-based trajectory generator and the final return. LPT can be learned with maximum likelihood estimation on trajectory-return pairs. In learning, posterior sampling of the latent variable naturally integrates sub-trajectories to form a consistent abstraction despite the finite context. At test time, the latent variable is inferred from an expected return before policy execution, realizing the idea of planning as inference. Our experiments demonstrate that LPT can discover improved decisions from sub-optimal trajectories, achieving competitive performance across several benchmarks, including Gym-Mujoco, Franka Kitchen, Maze2D, and Connect Four. It exhibits capabilities in nuanced credit assignments, trajectory stitching, and adaptation to environmental contingencies. These results validate that latent variable inference can be a strong alternative to step-wise reward prompting.

Deqian Kong, Dehong Xu, Minglu Zhao, Bo Pang, Jianwen Xie, Andrew Lizarraga, Yuhao Huang, Sirui Xie, Ying Nian Wu• 2024

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL Walker2d Medium v2
Normalized Return77.8
67
Offline Reinforcement LearningKitchen Partial--
62
Offline Reinforcement LearningD4RL Hopper-medium-replay v2
Normalized Return71.2
54
Offline Reinforcement Learninghopper medium--
52
Offline Reinforcement Learningwalker2d medium--
51
Offline Reinforcement Learningwalker2d medium-replay--
50
Offline Reinforcement Learninghopper medium-replay--
44
Offline Reinforcement LearningD4RL Hopper Medium v2
Normalized Return58.5
43
Offline Reinforcement LearningD4RL HalfCheetah Medium v2
Average Normalized Return43.1
43
Offline Reinforcement Learninghalfcheetah medium--
43
Showing 10 of 24 rows

Other info

Code

Follow for update