Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SPQR: Controlling Q-ensemble Independence with Spiked Random Model for Reinforcement Learning

About

Alleviating overestimation bias is a critical challenge for deep reinforcement learning to achieve successful performance on more complex tasks or offline datasets containing out-of-distribution data. In order to overcome overestimation bias, ensemble methods for Q-learning have been investigated to exploit the diversity of multiple Q-functions. Since network initialization has been the predominant approach to promote diversity in Q-functions, heuristically designed diversity injection methods have been studied in the literature. However, previous studies have not attempted to approach guaranteed independence over an ensemble from a theoretical perspective. By introducing a novel regularization loss for Q-ensemble independence based on random matrix theory, we propose spiked Wishart Q-ensemble independence regularization (SPQR) for reinforcement learning. Specifically, we modify the intractable hypothesis testing criterion for the Q-ensemble independence into a tractable KL divergence between the spectral distribution of the Q-ensemble and the target Wigner's semicircle distribution. We implement SPQR in several online and offline ensemble Q-learning algorithms. In the experiments, SPQR outperforms the baseline algorithms in both online and offline RL benchmarks.

Dohyeok Lee, Seungyub Han, Taehyun Cho, Jungwoo Lee• 2024

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningKitchen Partial
Normalized Score54.2
62
Offline Reinforcement LearningD4RL Gym walker2d (medium-replay)
Normalized Return89.6
52
Offline Reinforcement LearningD4RL Gym halfcheetah-medium
Normalized Return74.8
44
Offline Reinforcement LearningD4RL Gym walker2d medium
Normalized Return98.4
42
Offline Reinforcement LearningD4RL AntMaze
AntMaze Umaze Return93.3
39
Offline Reinforcement Learningantmaze medium-play
Score80
35
Offline Reinforcement LearningD4RL Gym walker2d medium-expert
Normalized Average Return118.2
31
Offline Reinforcement LearningD4RL Gym hopper-medium-expert
Normalized Avg Return112.5
29
Offline Reinforcement Learningkitchen mixed
Normalized Score55.6
29
Offline Reinforcement LearningD4RL Gym halfcheetah-medium-expert
Normalized Return114
28
Showing 10 of 27 rows

Other info

Code

Follow for update