Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning with Latent Diffusion in Offline Reinforcement Learning

About

Offline reinforcement learning (RL) holds promise as a means to learn high-reward policies from a static dataset, without the need for further environment interactions. However, a key challenge in offline RL lies in effectively stitching portions of suboptimal trajectories from the static dataset while avoiding extrapolation errors arising due to a lack of support in the dataset. Existing approaches use conservative methods that are tricky to tune and struggle with multi-modal data (as we show) or rely on noisy Monte Carlo return-to-go samples for reward conditioning. In this work, we propose a novel approach that leverages the expressiveness of latent diffusion to model in-support trajectory sequences as compressed latent skills. This facilitates learning a Q-function while avoiding extrapolation error via batch-constraining. The latent space is also expressive and gracefully copes with multi-modal data. We show that the learned temporally-abstract latent space encodes richer task-specific information for offline RL tasks as compared to raw state-actions. This improves credit assignment and facilitates faster reward propagation during Q-learning. Our method demonstrates state-of-the-art performance on the D4RL benchmarks, particularly excelling in long-horizon, sparse-reward tasks.

Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella, John Dolan, Jeff Schneider, Glen Berseth• 2023

Related benchmarks

TaskDatasetResultRank
LocomotionD4RL walker2d-medium-expert
Normalized Score109.3
63
LocomotionD4RL HalfCheetah Medium-Replay
Normalized Score0.418
61
LocomotionD4RL Halfcheetah medium
Normalized Score42.8
60
LocomotionD4RL Walker2d medium
Normalized Score69.4
60
LocomotionD4RL halfcheetah-medium-expert
Normalized Score90.2
53
Offline Reinforcement LearningD4RL antmaze-large (diverse)
Normalized Score57.7
37
LocomotionD4RL Hopper medium
Normalized Score66.2
30
Offline Reinforcement LearningD4RL Kitchen-Partial
Normalized Performance67.8
19
LocomotionD4RL hopper-medium-expert--
18
Offline Reinforcement LearningD4RL Maze2d-umaze
Normalized Performance Score134.2
12
Showing 10 of 16 rows

Other info

Follow for update