Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

About

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is leakage-free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationSimplerEnv WidowX Robot tasks (test)
Success Rate (Spoon)75
79
Robot ManipulationSimplerEnv Google Robot tasks Visual Matching
Pick Coke Can Success Rate88.3
62
Robotic ManipulationLIBERO-Plus
Camera Robustness Score63.3
34
Robotic ManipulationLIBERO v1 (test)
Config 10 Score95.8
27
Showing 4 of 4 rows

Other info

GitHub

Follow for update