SPQR: Controlling Q-ensemble Independence with Spiked Random Model for Reinforcement Learning
About
Alleviating overestimation bias is a critical challenge for deep reinforcement learning to achieve successful performance on more complex tasks or offline datasets containing out-of-distribution data. In order to overcome overestimation bias, ensemble methods for Q-learning have been investigated to exploit the diversity of multiple Q-functions. Since network initialization has been the predominant approach to promote diversity in Q-functions, heuristically designed diversity injection methods have been studied in the literature. However, previous studies have not attempted to approach guaranteed independence over an ensemble from a theoretical perspective. By introducing a novel regularization loss for Q-ensemble independence based on random matrix theory, we propose spiked Wishart Q-ensemble independence regularization (SPQR) for reinforcement learning. Specifically, we modify the intractable hypothesis testing criterion for the Q-ensemble independence into a tractable KL divergence between the spectral distribution of the Q-ensemble and the target Wigner's semicircle distribution. We implement SPQR in several online and offline ensemble Q-learning algorithms. In the experiments, SPQR outperforms the baseline algorithms in both online and offline RL benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | Kitchen Partial | Normalized Score54.2 | 62 | |
| Offline Reinforcement Learning | D4RL Gym walker2d (medium-replay) | Normalized Return89.6 | 52 | |
| Offline Reinforcement Learning | D4RL Gym halfcheetah-medium | Normalized Return74.8 | 44 | |
| Offline Reinforcement Learning | D4RL Gym walker2d medium | Normalized Return98.4 | 42 | |
| Offline Reinforcement Learning | D4RL AntMaze | AntMaze Umaze Return93.3 | 39 | |
| Offline Reinforcement Learning | antmaze medium-play | Score80 | 35 | |
| Offline Reinforcement Learning | D4RL Gym walker2d medium-expert | Normalized Average Return118.2 | 31 | |
| Offline Reinforcement Learning | D4RL Gym hopper-medium-expert | Normalized Avg Return112.5 | 29 | |
| Offline Reinforcement Learning | kitchen mixed | Normalized Score55.6 | 29 | |
| Offline Reinforcement Learning | D4RL Gym halfcheetah-medium-expert | Normalized Return114 | 28 |