Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Extreme Q-Learning: MaxEnt RL without Entropy

About

Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our \emph{Extreme Q-Learning} framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by \emph{10+ points} on the challenging Franka Kitchen tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks. Visualizations and code can be found on our website at https://div99.github.io/XQL/.

Divyansh Garg, Joey Hejna, Matthieu Geist, Stefano Ermon• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL Gym walker2d (medium-replay)
Normalized Return82.2
68
Offline Reinforcement LearningD4RL Gym halfcheetah-medium
Normalized Return48.3
60
Offline Reinforcement LearningD4RL Gym walker2d medium
Normalized Return84.2
58
Offline Reinforcement LearningD4RL antmaze-umaze (diverse)
Normalized Score77.2
47
Offline Reinforcement LearningD4RL Gym hopper (medium-replay)
Normalized Return100.7
44
Offline Reinforcement LearningD4RL Gym halfcheetah-medium-replay
Normalized Average Return45.2
43
Offline Reinforcement LearningD4RL Gym hopper-medium
Normalized Return74.2
41
Offline Reinforcement LearningD4RL Adroit pen (human)
Normalized Return105.3
39
Offline Reinforcement LearningD4RL Adroit pen (cloned)
Normalized Return112.6
39
Offline Reinforcement LearningD4RL Gym walker2d medium-expert
Normalized Average Return112.7
38
Showing 10 of 19 rows

Other info

Follow for update