Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning

About

Deep reinforcement learning (DRL) gives the promise that an agent learns good policy from high-dimensional information, whereas representation learning removes irrelevant and redundant information and retains pertinent information. In this work, we demonstrate that the learned representation of the $Q$-network and its target $Q$-network should, in theory, satisfy a favorable distinguishable representation property. Specifically, there exists an upper bound on the representation similarity of the value functions of two adjacent time steps in a typical DRL setting. However, through illustrative experiments, we show that the learned DRL agent may violate this property and lead to a sub-optimal policy. Therefore, we propose a simple yet effective regularizer called Policy Evaluation with Easy Regularization on Representation (PEER), which aims to maintain the distinguishable representation property via explicit regularization on internal representations. And we provide the convergence rate guarantee of PEER. Implementing PEER requires only one line of code. Our experiments demonstrate that incorporating PEER into DRL can significantly improve performance and sample efficiency. Comprehensive experiments show that PEER achieves state-of-the-art performance on all 4 environments on PyBullet, 9 out of 12 tasks on DMControl, and 19 out of 26 games on Atari. To the best of our knowledge, PEER is the first work to study the inherent representation property of Q-network and its target. Our code is available at https://sites.google.com/view/peer-cvpr2023/.

Qiang He, Huangyuan Su, Jieyu Zhang, Xinwen Hou• 2022

Related benchmarks

Task	Dataset	Result
Continuous Control	DMControl 500k	Spin Score864	42
Continuous Control	DMControl 100k	DMControl: Finger Spin Score820	38
Reinforcement Learning	Atari100k (test)	Alien Score1.22e+3	23
Continuous Control	HalfCheetah v3	Average Return7.46e+3	10
Continuous Control	Walker2d v3	Average Return3.61e+3	10
Continuous Control	Ant v3	Average Return4.36e+3	10
Reinforcement Learning	Atari 2600 (test)	Alien1.22e+3	10
Continuous Control	Mujoco	Hopper Score3.42e+3	7
Continuous Control	InvertedPendulum v2	Average Return983	7
Continuous Control	MuJoCo Suite Aggregate	Average Normalized Score72.8	7

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord