SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
About
Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Fall-and-recovery evaluation | Fall-and-recovery sequences lie-to-stand, prone-to-stand, and stand-to-lie (test) | CR42.8 | 8 | |
| Humanoid motion tracking | MuJoCo 101 held-out motion sequences (test) | CR (%)79.3 | 8 | |
| Human-to-robot clip-level retrieval | DPAE (val) | R@197.8 | 6 | |
| Robot-to-robot clip-level retrieval | DPAE (val) | R@197.2 | 6 | |
| Robot-to-human clip-level retrieval | DPAE (val) | R@197 | 6 | |
| Motion Tracking | Diverse Static and Dynamic Motions 2001 sequences | Success Rate1.88e+3 | 5 | |
| Humanoid motion tracking | MuJoCo evaluation suite (out-of-domain) | Empkpe227.9 | 4 |