Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Randomized Ensembled Double Q-Learning: Learning Fast Without a Model

About

Using a high Update-To-Data (UTD) ratio, model-based methods have recently achieved much higher sample efficiency than previous model-free methods for continuous-action DRL benchmarks. In this paper, we introduce a simple model-free algorithm, Randomized Ensembled Double Q-Learning (REDQ), and show that its performance is just as good as, if not better than, a state-of-the-art model-based algorithm for the MuJoCo benchmark. Moreover, REDQ can achieve this performance using fewer parameters than the model-based method, and with less wall-clock run time. REDQ has three carefully integrated ingredients which allow it to achieve its high performance: (i) a UTD ratio >> 1; (ii) an ensemble of Q functions; (iii) in-target minimization across a random subset of Q functions from the ensemble. Through carefully designed experiments, we provide a detailed analysis of REDQ and related model-free algorithms. To our knowledge, REDQ is the first successful model-free DRL algorithm for continuous-action spaces using a UTD ratio >> 1.

Xinyue Chen, Che Wang, Zijian Zhou, Keith Ross• 2021

Related benchmarks

TaskDatasetResultRank
Continuous ControlMuJoCo Ant v4
Average Return5.31e+3
46
Continuous ControlMuJoCo Walker2d v4--
39
Continuous ControlMuJoCo HalfCheetah v4
Average Return1.15e+4
36
Continuous ControlMuJoCo v5
Ant Score4.83e+3
15
Continuous ControlDeepMind Control Suite (DMC)
Cheetah Run866
15
Continuous ControlGym MuJoCo Hopper v4
Average Return3.30e+3
15
Continuous ControlGym MuJoCo Suite Aggregate
IQM1.135
15
Continuous ControlGym MuJoCo Humanoid v4
Average Return5.28e+3
15
Continuous ControlMujoco
Ant-v54.83e+3
9
Continuous ControlOpenAI Gym Mujoco 100K steps v2 (train)
InvertedPendulum-v2 Score1.00e+3
5
Showing 10 of 13 rows

Other info

Follow for update