OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

About

We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.

Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, Joelle Pineau, Kee-Eung Kim• 2021

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL halfcheetah-medium-expert	Normalized Score91.1	169
Offline Reinforcement Learning	D4RL hopper-medium-expert	Normalized Score111.5	161
Offline Reinforcement Learning	D4RL walker2d-medium-expert	Normalized Score74.8	132
Offline Reinforcement Learning	D4RL walker2d-random	Normalized Score9.9	101
Offline Reinforcement Learning	D4RL halfcheetah-random	Normalized Score11.6	94
Offline Reinforcement Learning	D4RL hopper-random	Normalized Score11.2	86
Offline Reinforcement Learning	hopper medium	Normalized Score94.1	68
Offline Reinforcement Learning	walker2d medium-replay	Normalized Score21.6	61
Offline Reinforcement Learning	walker2d medium	Normalized Score21.8	61
Offline Reinforcement Learning	hopper medium-replay	Normalized Score36.4	55

Showing 10 of 75 rows

...

Other info

Follow for update

@wizwand_team Discord