Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Conservative Q-Learning for Offline Reinforcement Learning

About

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine• 2020

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score95
155
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score111.9
153
Offline Reinforcement LearningD4RL walker2d-medium-expert
Normalized Score109.4
124
Reinforcement LearningHopper v5
Average Return1.78e+3
101
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score95
97
Offline Reinforcement LearningD4RL Medium HalfCheetah
Normalized Score46.9
97
Offline Reinforcement LearningD4RL Medium Walker2d
Normalized Score79.5
96
Offline Reinforcement LearningD4RL walker2d-random
Normalized Score270
93
Auto-biddingAuctionNet
Score363.2
90
Offline Reinforcement LearningD4RL halfcheetah-random
Normalized Score35.4
86
Showing 10 of 761 rows
...

Other info

Follow for update