Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Stabilizing Extreme Q-learning by Maclaurin Expansion

About

In offline reinforcement learning, in-sample learning methods have been widely used to prevent performance degradation caused by evaluating out-of-distribution actions from the dataset. Extreme Q-learning (XQL) employs a loss function based on the assumption that Bellman error follows a Gumbel distribution, enabling it to model the soft optimal value function in an in-sample manner. It has demonstrated strong performance in both offline and online reinforcement learning settings. However, issues remain, such as the instability caused by the exponential term in the loss function and the risk of the error distribution deviating from the Gumbel distribution. Therefore, we propose Maclaurin Expanded Extreme Q-learning to enhance stability. In this method, applying Maclaurin expansion to the loss function in XQL enhances stability against large errors. This approach involves adjusting the modeled value function between the value function under the behavior policy and the soft optimal value function, thus achieving a trade-off between stability and optimality depending on the order of expansion. It also enables adjustment of the error distribution assumption from a normal distribution to a Gumbel distribution. Our method significantly stabilizes learning in online RL tasks from DM Control, where XQL was previously unstable. Additionally, it improves performance in several offline RL tasks from D4RL.

Motoki Omura, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada• 2024

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL Gym walker2d (medium-replay)
Normalized Return83.6
68
Offline Reinforcement LearningD4RL Gym halfcheetah-medium
Normalized Return47.7
60
Offline Reinforcement LearningD4RL Gym walker2d medium
Normalized Return83.8
58
Offline Reinforcement LearningD4RL antmaze-umaze (diverse)
Normalized Score53.2
47
Offline Reinforcement LearningD4RL Gym hopper (medium-replay)
Normalized Return102.7
44
Offline Reinforcement LearningD4RL Gym halfcheetah-medium-replay
Normalized Average Return45.7
43
Offline Reinforcement LearningD4RL Gym hopper-medium
Normalized Return80.9
41
Offline Reinforcement LearningD4RL Adroit pen (cloned)
Normalized Return117.4
39
Offline Reinforcement LearningD4RL Adroit pen (human)
Normalized Return122.1
39
Offline Reinforcement LearningD4RL Gym walker2d medium-expert
Normalized Average Return111.2
38
Showing 10 of 17 rows

Other info

Follow for update