VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
About
Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is leakage-free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Object Achievement99.6 | 957 | |
| Robotic Manipulation | LIBERO-Plus | Language Understanding Score88.1 | 249 | |
| Robotic Manipulation | LIBERO v1 (test) | Average Success Rate97.2 | 83 | |
| Robot Manipulation | SimplerEnv WidowX Robot tasks (test) | Success Rate (Spoon)75 | 79 | |
| Robot Manipulation | SimplerEnv Google Robot tasks Visual Matching | Pick Coke Can Success Rate88.3 | 62 | |
| Robot Manipulation | LIBERO-Plus Zero-shot | Camera Score64.2 | 42 | |
| Robot Manipulation | SimplerEnv WidowX Visual Matching | Average Success Rate57.3 | 34 | |
| Robotic Manipulation | LIBERO-Plus (test) | Language Robustness Score85.4 | 32 | |
| Tabletop manipulation | LIBERO | Success Rate96.1 | 17 | |
| Robot Manipulation | SimplerEnv WidowX (held-out) | Put Spoon on Towel Success Rate75 | 14 |