Muesli: Combining Improvements in Policy Optimization
About
We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero's state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.
Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, Hado van Hasselt• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reinforcement Learning | ALE Atari 57 games | HWRB5 | 16 | |
| Reinforcement Learning | Atari-57 (full) | HWRB5 | 13 | |
| Atari Game Playing | Atari 57 games 200M environment frames | Median Human-Normalized Score1.04e+3 | 11 | |
| Reinforcement Learning | Atari 57 (ALE) 200M frames sticky actions | Median Human-Normalized Score1.04e+3 | 9 | |
| Reinforcement Learning | Atari 2600 (test) | Alien Score1.62e+4 | 5 |
Showing 5 of 5 rows