Muesli: Combining Improvements in Policy Optimization

About

We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero's state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.

Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, Hado van Hasselt• 2021

Related benchmarks

Task	Dataset	Result
Reinforcement Learning	ALE Atari 57 games	HWRB5	16
Reinforcement Learning	Atari-57 (full)	HWRB5	13
Atari Game Playing	Atari 57 games 200M environment frames	Median Human-Normalized Score1.04e+3	11
Reinforcement Learning	Atari 57 (ALE) 200M frames sticky actions	Median Human-Normalized Score1.04e+3	9
Reinforcement Learning	Atari 2600 (test)	Alien Score1.62e+4	5

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord