Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

About

World models learned from high-dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel-level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO-WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control-relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test-time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while operating in a reduced latent space, up to 10x smaller than that of DINO-WM. Moreover, our model is agnostic to the choice of pretrained visual encoder and maintains robustness when paired with DINOv2, SimDINOv2, and iBOT features.

Leonardo F. Toso, Davit Shadunts, Yunyang Lu, Nihal Sharma, Donglin Zhan, Nam H. Nguyen, James Anderson• 2026

Related benchmarks

Task	Dataset	Result
PointMaze Navigation	PointMaze Large Color background Change v1	Success Rate86	3
PointMaze Navigation	PointMaze Large Color Gradient background change v1	Success Rate78	3
PointMaze Navigation	PointMaze Slight background Change v1	Success Rate80	3
PointMaze Navigation	PointMaze (C) Color gradient background v1	Success Rate76	3
PointMaze Navigation	PointMaze Moving Distractors v1	Success Rate82	3
PointMaze Navigation	PointMaze NC v1	Success Rate78	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord