Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

About

We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.

Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, Joelle Pineau, Kee-Eung Kim• 2021

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score91.1
117
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score111.5
115
Offline Reinforcement LearningD4RL walker2d-medium-expert
Normalized Score74.8
86
Offline Reinforcement LearningD4RL walker2d-random
Normalized Score9.9
77
Offline Reinforcement LearningD4RL halfcheetah-random
Normalized Score11.6
70
Offline Reinforcement LearningD4RL hopper-random
Normalized Score11.2
62
Offline Reinforcement Learninghopper medium
Normalized Score94.1
52
Offline Reinforcement Learningwalker2d medium
Normalized Score21.8
51
Offline Reinforcement Learningwalker2d medium-replay
Normalized Score21.6
50
Offline Reinforcement Learninghopper medium-replay
Normalized Score36.4
44
Showing 10 of 75 rows
...

Other info

Follow for update