Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Off-Policy Actor-Critic with Shared Experience Replay

About

We investigate the combination of actor-critic reinforcement learning algorithms with uniform large-scale experience replay and propose solutions for two challenges: (a) efficient actor-critic learning with experience replay (b) stability of off-policy learning where agents learn from other agents behaviour. We employ those insights to accelerate hyper-parameter sweeps in which all participating agents run concurrently and share their experience via a common replay module. To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. We provide extensive empirical validation of the proposed solution. We further show the benefits of this setup by demonstrating state-of-the-art data efficiency on Atari among agents trained up until 200M environment frames.

Simon Schmitt, Matteo Hessel, Karen Simonyan• 2019

Related benchmarks

TaskDatasetResultRank
Reinforcement LearningALE Atari 57 games
HWRB7
16
Reinforcement LearningAtari 2600 57 games (test)
Median Human-Normalized Score431
15
Reinforcement LearningAtari-57 (full)
HWRB7
13
Atari Game PlayingAtari 57 games 200M environment frames
Median Human-Normalized Score431
11
Reinforcement LearningAtari 57 (ALE) 200M frames sticky actions
Median Human-Normalized Score431
9
Reinforcement LearningAtari 57 Standard
Median Human-Normalized Score448
5
Reinforcement LearningAtari small data setting
Median Human-Normalized Score431
5
Multi-task reinforcement learningDMLab-30 Multi-task Standard
Mean-Capped Human-Normalized Score81.7
4
Showing 8 of 8 rows

Other info

Follow for update