Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

About

World models learned from high-dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel-level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO-WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control-relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test-time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while operating in a reduced latent space, up to 10x smaller than that of DINO-WM. Moreover, our model is agnostic to the choice of pretrained visual encoder and maintains robustness when paired with DINOv2, SimDINOv2, and iBOT features.

Leonardo F. Toso, Davit Shadunts, Yunyang Lu, Nihal Sharma, Donglin Zhan, Nam H. Nguyen, James Anderson• 2026

Related benchmarks

TaskDatasetResultRank
PointMaze NavigationPointMaze Large Color background Change v1
Success Rate86
3
PointMaze NavigationPointMaze Large Color Gradient background change v1
Success Rate78
3
PointMaze NavigationPointMaze Slight background Change v1
Success Rate80
3
PointMaze NavigationPointMaze (C) Color gradient background v1
Success Rate76
3
PointMaze NavigationPointMaze Moving Distractors v1
Success Rate82
3
PointMaze NavigationPointMaze NC v1
Success Rate78
3
Showing 6 of 6 rows

Other info

Follow for update