Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning
About
Deep reinforcement learning (DRL) gives the promise that an agent learns good policy from high-dimensional information, whereas representation learning removes irrelevant and redundant information and retains pertinent information. In this work, we demonstrate that the learned representation of the $Q$-network and its target $Q$-network should, in theory, satisfy a favorable distinguishable representation property. Specifically, there exists an upper bound on the representation similarity of the value functions of two adjacent time steps in a typical DRL setting. However, through illustrative experiments, we show that the learned DRL agent may violate this property and lead to a sub-optimal policy. Therefore, we propose a simple yet effective regularizer called Policy Evaluation with Easy Regularization on Representation (PEER), which aims to maintain the distinguishable representation property via explicit regularization on internal representations. And we provide the convergence rate guarantee of PEER. Implementing PEER requires only one line of code. Our experiments demonstrate that incorporating PEER into DRL can significantly improve performance and sample efficiency. Comprehensive experiments show that PEER achieves state-of-the-art performance on all 4 environments on PyBullet, 9 out of 12 tasks on DMControl, and 19 out of 26 games on Atari. To the best of our knowledge, PEER is the first work to study the inherent representation property of Q-network and its target. Our code is available at https://sites.google.com/view/peer-cvpr2023/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Control | DMControl 500k | Spin Score864 | 33 | |
| Continuous Control | DMControl 100k | DMControl: Finger Spin Score820 | 29 | |
| Reinforcement Learning | Atari100k (test) | Alien Score1.22e+3 | 23 | |
| Reinforcement Learning | Atari 2600 (test) | Alien1.22e+3 | 10 | |
| Continuous Control | Mujoco | Hopper Score3.42e+3 | 7 | |
| Continuous Control | HalfCheetah v3 | Average Return7.46e+3 | 7 | |
| Continuous Control | Walker2d v3 | Average Return3.61e+3 | 7 | |
| Continuous Control | InvertedPendulum v2 | Average Return983 | 7 | |
| Continuous Control | MuJoCo Suite Aggregate | Average Normalized Score72.8 | 7 | |
| Continuous Control | Hopper v3 | Average Return2.72e+3 | 7 |