Evaluating Factor-Wise Auxiliary Dynamics Supervision for Latent Structure and Robustness in Simulated Humanoid Locomotion
About
We evaluate whether factor-wise auxiliary dynamics supervision produces useful latent structure or improved robustness in simulated humanoid locomotion. DynaMITE -- a transformer encoder with a factored 24-d latent trained by per-factor auxiliary losses during proximal policy optimization (PPO) -- is compared against Long Short-Term Memory (LSTM), plain Transformer, and Multilayer Perceptron (MLP) baselines on a Unitree G1 humanoid across four Isaac Lab tasks. The supervised latent shows no evidence of decodable or functionally separable factor structure: probe R^2 ~ 0 for all five dynamics factors, clamping any subspace changes reward by < 0.05, and standard disentanglement metrics (MIG, DCI, SAP) are near zero. An unsupervised LSTM hidden state achieves higher probe R^2 (up to 0.10). A 2x2 factorial ablation (n = 10 seeds) isolates the contributions of the tanh bottleneck and auxiliary losses: the auxiliary losses show no measurable effect on either in-distribution (ID) reward (+0.03, p = 0.732) or severe out-of-distribution (OOD) reward (+0.03, p = 0.669), while the bottleneck shows a small, consistent advantage in both regimes (ID: +0.16, p = 0.207; OOD: +0.10, p = 0.208). The bottleneck advantage persists under severe combined perturbation but does not amplify, indicating a training-time representation benefit rather than a robustness mechanism. LSTM achieves the best nominal reward on all four tasks (p < 0.03); DynaMITE degrades less under combined-shift stress (2.3% vs. 16.7%), but this difference is attributable to the bottleneck compression, not the auxiliary supervision. For locomotion practitioners: auxiliary dynamics supervision does not produce an interpretable estimator and does not measurably improve reward or robustness beyond what the bottleneck alone provides; recurrent baselines remain the stronger choice for nominal performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Command tracking recovery after push | Flat terrain task | Recovery Steps5.6 | 28 | |
| Humanoid Locomotion | Humanoid Randomized Task (OOD Sweep) | Reward-4.37 | 24 | |
| Push recovery | Flat task Push-recovery protocol | Peak Tracking Error3.96 | 14 | |
| Velocity tracking | Combined-shift Level 3 | Tracking Error4.23 | 4 | |
| Velocity tracking | Combined-shift Level 4 | Velocity Tracking Error6.13 | 4 | |
| Humanoid Locomotion | Flat In-distribution (deterministic evaluation) | Cumulative Reward4.88 | 4 | |
| Humanoid Locomotion | Terrain In-distribution (deterministic evaluation) | Cumulative Reward4.49 | 4 | |
| Velocity tracking | Combined-shift Level 1 | Tracking Error2.56 | 4 | |
| Humanoid Locomotion | Push In-distribution (deterministic evaluation) | Cumulative Reward4.6 | 4 | |
| Humanoid Locomotion | Randomized In-distribution (deterministic evaluation) | Cumulative Reward4.48 | 4 |