Hyperspherical Normalization for Scalable Deep Reinforcement Learning
About
Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains. The code is available at https://dojeon-ai.github.io/SimbaV2.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Locomotion | Dog & Humanoid suite | IQM0.808 | 32 | |
| Dexterous Manipulation | MyoSuite | IQM0.99 | 28 | |
| Humanoid Locomotion and Manipulation | HumanoidBench | IQM0.799 | 28 | |
| Continuous Control | DeepMind Control (DMC) Suite 500k steps | IQM73 | 8 | |
| Continuous Control | Gym MuJoCo | Normalized Reward (TD3)1.44 | 8 | |
| Continuous Control | DeepMind Control Suite (DMC) | Total Reward0.84 | 8 | |
| Continuous Control | DeepMind Control (DMC) Suite (100k steps) | IQM0.235 | 8 | |
| Continuous Control | DeepMind Control (DMC) Suite 200k steps | IQM49.5 | 8 | |
| Continuous Control | DeepMind Control (DMC) Suite (1M steps) | IQM84.5 | 8 | |
| Continuous Control | HumanoidBench No Hand | Total Reward380 | 8 |