Offline Reinforcement Learning with Fisher Divergence Critic Regularization

About

Many modern approaches to offline Reinforcement Learning (RL) utilize behavior regularization, typically augmenting a model-free actor critic algorithm with a penalty measuring divergence of the policy from the offline data. In this work, we propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the log-behavior-policy, which generated the offline data, plus a state-action value offset term, which can be learned using a neural network. Behavior regularization then corresponds to an appropriate regularizer on the offset term. We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature. We thus term our resulting algorithm Fisher-BRC (Behavior Regularized Critic). On standard offline RL benchmarks, Fisher-BRC achieves both improved performance and faster convergence over existing state-of-the-art methods.

Ilya Kostrikov, Jonathan Tompson, Rob Fergus, Ofir Nachum• 2021

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL walker2d-random	Normalized Score60	101
Offline Reinforcement Learning	D4RL halfcheetah-random	Normalized Score32.2	94
Locomotion	D4RL walker2d-medium-expert	Normalized Score109.6	90
Offline Reinforcement Learning	D4RL hopper-random	Normalized Score11.4	86
walker2d locomotion	D4RL walker2d medium-replay	Normalized Score73.8	78
hopper locomotion	D4RL hopper medium-replay	Normalized Score94.7	71
Locomotion	D4RL Halfcheetah medium	Normalized Score47.4	70
Locomotion	D4RL Walker2d medium	Normalized Score0.783	70
hopper locomotion	D4RL hopper-medium-expert	Normalized Score91.5	53
Locomotion	D4RL halfcheetah-medium-expert	Normalized Score86.7	53

Showing 10 of 101 rows

...

Other info

Follow for update

@wizwand_team Discord