A Mixture of Surprises for Unsupervised Reinforcement Learning

About

Unsupervised reinforcement learning aims at learning a generalist policy in a reward-free manner for fast adaptation to downstream tasks. Most of the existing methods propose to provide an intrinsic reward based on surprise. Maximizing or minimizing surprise drives the agent to either explore or gain control over its environment. However, both strategies rely on a strong assumption: the entropy of the environment's dynamics is either high or low. This assumption may not always hold in real-world scenarios, where the entropy of the environment's dynamics may be unknown. Hence, choosing between the two objectives is a dilemma. We propose a novel yet simple mixture of policies to address this concern, allowing us to optimize an objective that simultaneously maximizes and minimizes the surprise. Concretely, we train one mixture component whose objective is to maximize the surprise and another whose objective is to minimize the surprise. Hence, our method does not make assumptions about the entropy of the environment's dynamics. We call our method a $\textbf{M}\text{ixture }\textbf{O}\text{f }\textbf{S}\text{urprise}\textbf{S}$ (MOSS) for unsupervised reinforcement learning. Experimental results show that our simple method achieves state-of-the-art performance on the URLB benchmark, outperforming previous pure surprise maximization-based objectives. Our code is available at: https://github.com/LeapLabTHU/MOSS.

Andrew Zhao, Matthieu Gaetan Lin, Yangguang Li, Yong-Jin Liu, Gao Huang• 2022

Related benchmarks

Task	Dataset	Result
Bottom Left	URLB Jaco 1.0 (test)	Mean Score151	12
Flip	URLB Walker 1.0 (test)	Mean Score729	12
Run	URLB Quadruped 1.0 (test)	Mean Score485	12
Stand	URLB Quadruped 1.0 (test)	Mean Score911	12
Top Left	URLB Jaco 1.0 (test)	Mean Score150	12
Top Right	URLB Jaco 1.0 (test)	Mean Score150	12
Unsupervised Reinforcement Learning	URL Benchmark Quadruped	Jump Score627	12
Walk	URLB Walker 1.0 (test)	Mean Score942	12
Walk	URLB Quadruped 1.0 (test)	Mean Score635	12
Jump	URLB Quadruped 1.0 (test)	Mean Score674	12

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord