Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

About

Effective offline RL methods require properly handling out-of-distribution actions. Implicit Q-learning (IQL) addresses this by training a Q-function using only dataset actions through a modified Bellman backup. However, it is unclear which policy actually attains the values represented by this implicitly trained Q-function. In this paper, we reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression (AWR) used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sampled our intended policy. We introduce Implicit Diffusion Q-learning (IDQL), combining our general IQL critic with the policy extraction method. IDQL maintains the ease of implementation of IQL while outperforming prior offline RL methods and demonstrating robustness to hyperparameters. Code is available at https://github.com/philippe-eecs/IDQL.

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, Sergey Levine• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score94.4
169
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score105.3
161
Offline Reinforcement LearningD4RL walker2d-medium-expert
Normalized Score111.6
132
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score82.4
109
Offline Reinforcement LearningD4RL Medium HalfCheetah
Normalized Score49.7
105
Offline Reinforcement LearningD4RL Medium Walker2d
Normalized Score80.2
104
Offline Reinforcement LearningD4RL Medium-Replay HalfCheetah
Normalized Score45.1
97
LocomotionD4RL walker2d-medium-expert
Normalized Score110.6
90
walker2d locomotionD4RL walker2d medium-replay
Normalized Score89.1
78
Offline Reinforcement LearningD4RL antmaze-umaze (diverse)
Normalized Score80.2
74
Showing 10 of 133 rows
...

Other info

Follow for update