CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity
About
Sample efficiency is a crucial problem in deep reinforcement learning. Recent algorithms, such as REDQ and DroQ, found a way to improve the sample efficiency by increasing the update-to-data (UTD) ratio to 20 gradient update steps on the critic per environment sample. However, this comes at the expense of a greatly increased computational cost. To reduce this computational burden, we introduce CrossQ: A lightweight algorithm for continuous control tasks that makes careful use of Batch Normalization and removes target networks to surpass the current state-of-the-art in sample efficiency while maintaining a low UTD ratio of 1. Notably, CrossQ does not rely on advanced bias-reduction schemes used in current methods. CrossQ's contributions are threefold: (1) it matches or surpasses current state-of-the-art methods in terms of sample efficiency, (2) it substantially reduces the computational cost compared to REDQ and DroQ, (3) it is easy to implement, requiring just a few lines of code on top of SAC.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Control | MuJoCo Ant v4 | Average Return6.98e+3 | 46 | |
| Continuous Control | MuJoCo Walker2d v4 | -- | 39 | |
| Continuous Control | MuJoCo HalfCheetah v4 | Average Return1.29e+4 | 36 | |
| Continuous Control | MuJoCo Ant | Average Reward4.88e+3 | 26 | |
| Continuous Control | Gym MuJoCo Humanoid v4 | Average Return1.05e+4 | 15 | |
| Continuous Control | Gym MuJoCo Suite Aggregate | IQM1.565 | 15 | |
| Continuous Control | Gym MuJoCo Hopper v4 | Average Return2.47e+3 | 15 | |
| Continuous Control | MuJoCo Humanoid | Average Reward6.27e+3 | 13 | |
| Continuous Control | MuJoCo Walker2d | Max Return4.58e+3 | 13 | |
| Continuous Control | MuJoCo Hopper | Maximum Average Return3.30e+3 | 13 |