ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update

About

In this study, we investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL). DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning. However, they typically perform much worse than current state-of-the-art (SOTA) methods that solely use action-level behavior constraint. After revisiting DICE-based methods, we find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state). Using forward gradient bears a large similarity to many offline RL methods, and thus can be regarded as applying action-level constraint. However, directly adding the backward gradient may degenerate or cancel out its effect if these two gradients have conflicting directions. To resolve this issue, we propose a simple yet effective modification that projects the backward gradient onto the normal plane of the forward gradient, resulting in an orthogonal-gradient update, a new learning rule for DICE-based methods. We conduct thorough theoretical analyses and find that the projected backward gradient brings state-level behavior regularization, which reveals the mystery of DICE-based methods: the value learning objective does try to impose state-action-level constraint, but needs to be used in a corrected way. Through toy examples and extensive experiments on complex offline RL and IL tasks, we demonstrate that DICE-based methods using orthogonal-gradient updates (O-DICE) achieve SOTA performance and great robustness.

Liyuan Mao, Haoran Xu, Weinan Zhang, Xianyuan Zhan• 2024

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	hopper medium	Normalized Score86.1	68
Offline Reinforcement Learning	walker2d medium-replay	Normalized Score83.6	61
Offline Reinforcement Learning	walker2d medium	Normalized Score84.9	61
Offline Reinforcement Learning	hopper medium-replay	Normalized Score99.9	55
Offline Reinforcement Learning	halfcheetah medium-replay	Normalized Score44	54
Offline Reinforcement Learning	halfcheetah medium	Normalized Score47.4	53
Offline Reinforcement Learning	Walker2d medium-expert	Normalized Score110.8	42
Offline Reinforcement Learning	Hopper medium-expert	Normalized Score110.8	35
Offline Reinforcement Learning	Halfcheetah medium-expert	Normalized Score93.2	26
Offline Reinforcement Learning	D4RL v2	Score (HalfCheetah-M)47.4	9

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord