Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control

About

Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Mi{\l}o\'s, Marek Cygan• 2024

Related benchmarks

Task	Dataset	Result
Continuous Control	MuJoCo Ant v4	Average Return7.03e+3	46
Continuous Control	MuJoCo Walker2d v4	--	39
Continuous Control	MuJoCo HalfCheetah v4	Average Return1.37e+4	36
Locomotion	Dog & Humanoid suite	IQM0.864	32
Dexterous Manipulation	MyoSuite	IQM0.98	28
Humanoid Locomotion and Manipulation	HumanoidBench	IQM0.53	28
Continuous Control	Gym MuJoCo Suite Aggregate	IQM1.071	15
Continuous Control	Gym MuJoCo Hopper v4	Average Return2.12e+3	15
Continuous Control	Gym MuJoCo Humanoid v4	Average Return4.76e+3	15
Continuous Control	DeepMind Control (DMC) Suite (1M steps)	IQM84.6	14

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord