Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Hyperspherical Normalization for Scalable Deep Reinforcement Learning

About

Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains. The code is available at https://dojeon-ai.github.io/SimbaV2.

Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, Jaegul Choo• 2025

Related benchmarks

TaskDatasetResultRank
LocomotionDog & Humanoid suite
IQM0.808
32
Dexterous ManipulationMyoSuite
IQM0.99
28
Humanoid Locomotion and ManipulationHumanoidBench
IQM0.799
28
Continuous ControlDeepMind Control (DMC) Suite 500k steps
IQM73
8
Continuous ControlGym MuJoCo
Normalized Reward (TD3)1.44
8
Continuous ControlDeepMind Control Suite (DMC)
Total Reward0.84
8
Continuous ControlDeepMind Control (DMC) Suite (100k steps)
IQM0.235
8
Continuous ControlDeepMind Control (DMC) Suite 200k steps
IQM49.5
8
Continuous ControlDeepMind Control (DMC) Suite (1M steps)
IQM84.5
8
Continuous ControlHumanoidBench No Hand
Total Reward380
8
Showing 10 of 11 rows

Other info

Follow for update