Offline Reinforcement Learning with Fisher Divergence Critic Regularization
About
Many modern approaches to offline Reinforcement Learning (RL) utilize behavior regularization, typically augmenting a model-free actor critic algorithm with a penalty measuring divergence of the policy from the offline data. In this work, we propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the log-behavior-policy, which generated the offline data, plus a state-action value offset term, which can be learned using a neural network. Behavior regularization then corresponds to an appropriate regularizer on the offset term. We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature. We thus term our resulting algorithm Fisher-BRC (Behavior Regularized Critic). On standard offline RL benchmarks, Fisher-BRC achieves both improved performance and faster convergence over existing state-of-the-art methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL walker2d-random | Normalized Score60 | 77 | |
| Offline Reinforcement Learning | D4RL halfcheetah-random | Normalized Score32.2 | 70 | |
| Offline Reinforcement Learning | D4RL hopper-random | Normalized Score11.4 | 62 | |
| hopper locomotion | D4RL hopper medium-replay | Normalized Score94.7 | 56 | |
| walker2d locomotion | D4RL walker2d medium-replay | Normalized Score73.8 | 53 | |
| Locomotion | D4RL walker2d-medium-expert | Normalized Score109.6 | 47 | |
| Locomotion | D4RL Halfcheetah medium | Normalized Score47.4 | 44 | |
| Locomotion | D4RL Walker2d medium | Normalized Score0.783 | 44 | |
| hopper locomotion | D4RL Hopper medium | Normalized Score66.2 | 38 | |
| hopper locomotion | D4RL hopper-medium-expert | Normalized Score91.5 | 38 |