Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity

About

Sample efficiency is a crucial problem in deep reinforcement learning. Recent algorithms, such as REDQ and DroQ, found a way to improve the sample efficiency by increasing the update-to-data (UTD) ratio to 20 gradient update steps on the critic per environment sample. However, this comes at the expense of a greatly increased computational cost. To reduce this computational burden, we introduce CrossQ: A lightweight algorithm for continuous control tasks that makes careful use of Batch Normalization and removes target networks to surpass the current state-of-the-art in sample efficiency while maintaining a low UTD ratio of 1. Notably, CrossQ does not rely on advanced bias-reduction schemes used in current methods. CrossQ's contributions are threefold: (1) it matches or surpasses current state-of-the-art methods in terms of sample efficiency, (2) it substantially reduces the computational cost compared to REDQ and DroQ, (3) it is easy to implement, requiring just a few lines of code on top of SAC.

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, Jan Peters• 2019

Related benchmarks

TaskDatasetResultRank
Continuous ControlMuJoCo Ant
Average Reward4.88e+3
26
Continuous ControlMuJoCo Humanoid
Average Reward6.27e+3
13
Continuous ControlMuJoCo Walker2d
Max Return4.58e+3
13
Continuous ControlMuJoCo Hopper
Maximum Average Return3.30e+3
13
TractographyISMRM in silico 2015
VC (%)91.64
11
WalkingMyoLeg Walk
Total Actuator Activation38.47
2
WalkingMS-Human-700-Walk
Total Actuator Activation356.9
2
Showing 7 of 7 rows

Other info

Follow for update