Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models
About
World models learned from high-dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel-level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO-WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control-relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test-time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while operating in a reduced latent space, up to 10x smaller than that of DINO-WM. Moreover, our model is agnostic to the choice of pretrained visual encoder and maintains robustness when paired with DINOv2, SimDINOv2, and iBOT features.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| PointMaze Navigation | PointMaze Large Color background Change v1 | Success Rate86 | 3 | |
| PointMaze Navigation | PointMaze Large Color Gradient background change v1 | Success Rate78 | 3 | |
| PointMaze Navigation | PointMaze Slight background Change v1 | Success Rate80 | 3 | |
| PointMaze Navigation | PointMaze (C) Color gradient background v1 | Success Rate76 | 3 | |
| PointMaze Navigation | PointMaze Moving Distractors v1 | Success Rate82 | 3 | |
| PointMaze Navigation | PointMaze NC v1 | Success Rate78 | 3 |