CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity

About

Sample efficiency is a crucial problem in deep reinforcement learning. Recent algorithms, such as REDQ and DroQ, found a way to improve the sample efficiency by increasing the update-to-data (UTD) ratio to 20 gradient update steps on the critic per environment sample. However, this comes at the expense of a greatly increased computational cost. To reduce this computational burden, we introduce CrossQ: A lightweight algorithm for continuous control tasks that makes careful use of Batch Normalization and removes target networks to surpass the current state-of-the-art in sample efficiency while maintaining a low UTD ratio of 1. Notably, CrossQ does not rely on advanced bias-reduction schemes used in current methods. CrossQ's contributions are threefold: (1) it matches or surpasses current state-of-the-art methods in terms of sample efficiency, (2) it substantially reduces the computational cost compared to REDQ and DroQ, (3) it is easy to implement, requiring just a few lines of code on top of SAC.

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, Jan Peters• 2019

Related benchmarks

Task	Dataset	Result
Continuous Control	MuJoCo Ant v4	Average Return6.98e+3	46
Continuous Control	MuJoCo Walker2d v4	--	39
Continuous Control	MuJoCo HalfCheetah v4	Average Return1.29e+4	36
Continuous Control	MuJoCo Ant	Average Reward4.88e+3	26
Continuous Control	Gym MuJoCo Humanoid v4	Average Return1.05e+4	15
Continuous Control	Gym MuJoCo Suite Aggregate	IQM1.565	15
Continuous Control	Gym MuJoCo Hopper v4	Average Return2.47e+3	15
Continuous Control	MuJoCo Humanoid	Average Reward6.27e+3	13
Continuous Control	MuJoCo Walker2d	Max Return4.58e+3	13
Continuous Control	MuJoCo Hopper	Maximum Average Return3.30e+3	13

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord