Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mildly Conservative Q-Learning for Offline Reinforcement Learning

About

Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OOD actions. Experimental results on the D4RL benchmarks demonstrate that MCQ achieves remarkable performance compared with prior work. Furthermore, MCQ shows superior generalization ability when transferring from offline to online, and significantly outperforms baselines. Our code is publicly available at https://github.com/dmksjfl/MCQ.

Jiafei Lyu, Xiaoteng Ma, Xiu Li, Zongqing Lu• 2022

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL antmaze-umaze (diverse)
Normalized Score80
40
Offline Reinforcement LearningD4RL AntMaze
AntMaze Umaze Return98.3
39
Offline Reinforcement LearningD4RL MuJoCo Hopper medium standard
Normalized Score78.4
36
Offline Reinforcement LearningD4RL Locomotion medium, medium-replay, medium-expert v2
Score (HalfCheetah, Medium)60.98
34
Offline Reinforcement LearningHopper D4RL v2 (offline)
Average Score76.3
32
Offline Reinforcement LearningWalker2d D4RL v2 (offline)
Return69.4
32
Offline Reinforcement LearningD4RL Adroit pen (human)
Normalized Return68.5
32
Offline Reinforcement LearningHalfcheetah D4RL v2 (offline)
Average Score32.6
32
Offline Reinforcement LearningD4RL Adroit pen (cloned)
Normalized Return49.4
32
Offline Reinforcement LearningD4RL Adroit (expert, human)
Adroit Door Return (Human)2.3
29
Showing 10 of 40 rows

Other info

Follow for update