Adversarially Trained Actor Critic for Offline Reinforcement Learning
About
We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL halfcheetah-medium-expert | Normalized Score94.8 | 117 | |
| Offline Reinforcement Learning | D4RL hopper-medium-expert | Normalized Score119.2 | 115 | |
| Offline Reinforcement Learning | D4RL walker2d-medium-expert | Normalized Score114.2 | 86 | |
| Offline Reinforcement Learning | D4RL walker2d-random | Normalized Score6.8 | 77 | |
| Offline Reinforcement Learning | D4RL halfcheetah-random | Normalized Score3.9 | 70 | |
| Offline Reinforcement Learning | D4RL hopper-random | Normalized Score17.5 | 62 | |
| Offline Reinforcement Learning | hopper medium | Normalized Score85.6 | 52 | |
| Offline Reinforcement Learning | walker2d medium | Normalized Score89.6 | 51 | |
| Offline Reinforcement Learning | walker2d medium-replay | Normalized Score86.5 | 50 | |
| Offline Reinforcement Learning | hopper medium-replay | Normalized Score102.5 | 44 |