Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Extreme Q-Learning: MaxEnt RL without Entropy

About

Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our \emph{Extreme Q-Learning} framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by \emph{10+ points} on the challenging Franka Kitchen tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks. Visualizations and code can be found on our website at https://div99.github.io/XQL/.

Divyansh Garg, Joey Hejna, Matthieu Geist, Stefano Ermon• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL antmaze-umaze (diverse)
Normalized Score77.2
74
Offline Reinforcement LearningD4RL Gym walker2d (medium-replay)
Normalized Return82.2
73
Offline Reinforcement LearningD4RL Gym halfcheetah-medium
Normalized Return48.3
65
Offline Reinforcement LearningD4RL Gym walker2d medium
Normalized Return84.2
63
Offline Reinforcement LearningD4RL Adroit pen (human)
Normalized Return105.3
53
Offline Reinforcement LearningD4RL Adroit pen (cloned)
Normalized Return112.6
53
Offline Reinforcement LearningD4RL Gym hopper (medium-replay)
Normalized Return100.7
49
Offline Reinforcement LearningD4RL Gym halfcheetah-medium-replay
Normalized Average Return45.2
48
Offline Reinforcement LearningD4RL Gym hopper-medium
Normalized Return74.2
46
Offline Reinforcement LearningD4RL Gym walker2d medium-expert
Normalized Average Return112.7
43
Showing 10 of 34 rows

Other info

Follow for update