Latent Plan Transformer for Trajectory Abstraction: Planning as Latent Space Inference

About

In tasks aiming for long-term returns, planning becomes essential. We study generative modeling for planning with datasets repurposed from offline reinforcement learning. Specifically, we identify temporal consistency in the absence of step-wise rewards as one key technical challenge. We introduce the Latent Plan Transformer (LPT), a novel model that leverages a latent variable to connect a Transformer-based trajectory generator and the final return. LPT can be learned with maximum likelihood estimation on trajectory-return pairs. In learning, posterior sampling of the latent variable naturally integrates sub-trajectories to form a consistent abstraction despite the finite context. At test time, the latent variable is inferred from an expected return before policy execution, realizing the idea of planning as inference. Our experiments demonstrate that LPT can discover improved decisions from sub-optimal trajectories, achieving competitive performance across several benchmarks, including Gym-Mujoco, Franka Kitchen, Maze2D, and Connect Four. It exhibits capabilities in nuanced credit assignments, trajectory stitching, and adaptation to environmental contingencies. These results validate that latent variable inference can be a strong alternative to step-wise reward prompting.

Deqian Kong, Dehong Xu, Minglu Zhao, Bo Pang, Jianwen Xie, Andrew Lizarraga, Yuhao Huang, Sirui Xie, Ying Nian Wu• 2024

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL Walker2d Medium v2	Normalized Return77.8	67
Offline Reinforcement Learning	Kitchen Partial	--	62
Offline Reinforcement Learning	hopper medium	--	58
Offline Reinforcement Learning	D4RL Hopper-medium-replay v2	Normalized Return71.2	54
Offline Reinforcement Learning	walker2d medium	--	51
Offline Reinforcement Learning	walker2d medium-replay	--	50
Offline Reinforcement Learning	hopper medium-replay	--	44
Offline Reinforcement Learning	D4RL Hopper Medium v2	Normalized Return58.5	43
Offline Reinforcement Learning	D4RL HalfCheetah Medium v2	Average Normalized Return43.1	43
Offline Reinforcement Learning	halfcheetah medium	--	43

Showing 10 of 24 rows

Other info

Code

Follow for update

@wizwand_team Discord