Adversarially Trained Actor Critic for Offline Reinforcement Learning

About

We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.

Ching-An Cheng, Tengyang Xie, Nan Jiang, Alekh Agarwal• 2022

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL halfcheetah-medium-expert	Normalized Score94.8	169
Offline Reinforcement Learning	D4RL hopper-medium-expert	Normalized Score119.2	161
Offline Reinforcement Learning	D4RL walker2d-medium-expert	Normalized Score114.2	132
Offline Reinforcement Learning	D4RL Medium-Replay Hopper	Normalized Score102.5	109
Offline Reinforcement Learning	D4RL Medium HalfCheetah	Normalized Score53.3	105
Offline Reinforcement Learning	D4RL Medium Walker2d	Normalized Score89.6	104
Offline Reinforcement Learning	D4RL walker2d-random	Normalized Score6.8	101
Offline Reinforcement Learning	D4RL Medium-Replay HalfCheetah	Normalized Score48	97
Offline Reinforcement Learning	D4RL halfcheetah-random	Normalized Score3.9	94
Offline Reinforcement Learning	D4RL hopper-random	Normalized Score17.5	86

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord