Off-Policy Actor-Critic with Shared Experience Replay
About
We investigate the combination of actor-critic reinforcement learning algorithms with uniform large-scale experience replay and propose solutions for two challenges: (a) efficient actor-critic learning with experience replay (b) stability of off-policy learning where agents learn from other agents behaviour. We employ those insights to accelerate hyper-parameter sweeps in which all participating agents run concurrently and share their experience via a common replay module. To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. We provide extensive empirical validation of the proposed solution. We further show the benefits of this setup by demonstrating state-of-the-art data efficiency on Atari among agents trained up until 200M environment frames.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reinforcement Learning | ALE Atari 57 games | HWRB7 | 16 | |
| Reinforcement Learning | Atari 2600 57 games (test) | Median Human-Normalized Score431 | 15 | |
| Reinforcement Learning | Atari-57 (full) | HWRB7 | 13 | |
| Atari Game Playing | Atari 57 games 200M environment frames | Median Human-Normalized Score431 | 11 | |
| Reinforcement Learning | Atari 57 (ALE) 200M frames sticky actions | Median Human-Normalized Score431 | 9 | |
| Reinforcement Learning | Atari 57 Standard | Median Human-Normalized Score448 | 5 | |
| Reinforcement Learning | Atari small data setting | Median Human-Normalized Score431 | 5 | |
| Multi-task reinforcement learning | DMLab-30 Multi-task Standard | Mean-Capped Human-Normalized Score81.7 | 4 |