Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control
About
Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Control | MuJoCo Ant v4 | Average Return7.03e+3 | 46 | |
| Continuous Control | MuJoCo Walker2d v4 | -- | 39 | |
| Continuous Control | MuJoCo HalfCheetah v4 | Average Return1.37e+4 | 36 | |
| Locomotion | Dog & Humanoid suite | IQM0.864 | 32 | |
| Dexterous Manipulation | MyoSuite | IQM0.98 | 28 | |
| Humanoid Locomotion and Manipulation | HumanoidBench | IQM0.53 | 28 | |
| Continuous Control | Gym MuJoCo Suite Aggregate | IQM1.071 | 15 | |
| Continuous Control | Gym MuJoCo Hopper v4 | Average Return2.12e+3 | 15 | |
| Continuous Control | Gym MuJoCo Humanoid v4 | Average Return4.76e+3 | 15 | |
| Continuous Control | DeepMind Control (DMC) Suite (1M steps) | IQM84.6 | 14 |