Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

About

Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing $Q$-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed \textit{In-sample Learning} paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse $Q$-learning (SQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works, but in a complete in-sample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes.

Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, Xianyuan Zhan• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement Learningantmaze medium-play
Score80.2
35
Offline Reinforcement LearningD4RL Locomotion medium, medium-replay, medium-expert v2
Score (HalfCheetah, Medium)48.3
34
Offline Reinforcement LearningMuJoCo hopper D4RL (medium-replay)
Normalized Return99.7
26
Offline Reinforcement LearningMuJoCo walker2d-medium D4RL
Normalized Return84.2
20
Offline Reinforcement LearningMuJoCo walker2d medium-replay D4RL
Normalized Return81.2
20
Offline Reinforcement LearningMuJoCo halfcheetah-medium-replay D4RL
Normalized Return44.8
20
Offline Reinforcement LearningMuJoCo halfcheetah-medium D4RL
Normalized Return48.3
20
Offline Reinforcement LearningD4RL Kitchen kitchen-partial v0 (test)
Normalized Score74.5
18
Offline Reinforcement LearningMuJoCo halfcheetah-medium-expert D4RL
Normalized Return94
18
Offline Reinforcement LearningMuJoCo walker2d medium-expert D4RL
Normalized Return110
18
Showing 10 of 39 rows

Other info

Follow for update