Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning

About

In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment. In contrast to the online setting, only using static datasets poses additional challenges, such as policies generating out-of-distribution samples. Model-based offline reinforcement learning methods try to overcome these by learning a model of the underlying dynamics of the environment and using it to guide policy search. It is beneficial but, with limited datasets, errors in the model and the issue of value overestimation among out-of-distribution states can worsen performance. Current model-based methods apply some notion of conservatism to the Bellman update, often implemented using uncertainty estimation derived from model ensembles. In this paper, we propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions. We cast policy learning as a constrained objective to always stay within the support of the latent action distribution, and use the generative capabilities of the model to impose an implicit constraint on the generated actions. Thereby eliminating the need to use additional uncertainty penalties on the Bellman update and significantly decreasing the number of gradient steps required to learn a policy. We empirically evaluate C-LAP on the D4RL and V-D4RL benchmark, and show that C-LAP is competitive to state-of-the-art methods, especially outperforming on datasets with visual observations.

Marvin Alles, Philip Becker-Ehmck, Patrick van der Smagt, Maximilian Karl• 2024

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score96.8
117
Offline Reinforcement Learningwalker2d medium
Normalized Score71.1
51
Offline Reinforcement Learningwalker2d medium-replay
Normalized Score81.2
50
Offline Reinforcement LearningD4RL walker2d medium-replay
Normalized Score86
45
Offline Reinforcement Learninghalfcheetah medium-replay
Normalized Score65
43
NavigationD4RL antmaze-medium-play
Normalized Score77.5
22
Offline Reinforcement LearningD4RL Walker2d expert
Mean Normalized Score111.7
22
NavigationD4RL antmaze-medium-diverse
Normalized Score45
22
Offline Reinforcement LearningD4RL HalfCheetah Med-Replay
Normalized Avg Return55.5
20
Offline Reinforcement LearningD4RL Walker2d medium
Normalized Avg Return82.5
18
Showing 10 of 31 rows

Other info

Code

Follow for update