VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

About

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is leakage-free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen• 2026

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement99.6	957
Robotic Manipulation	LIBERO-Plus	Language Understanding Score88.1	249
Robotic Manipulation	LIBERO v1 (test)	Average Success Rate97.2	83
Robot Manipulation	SimplerEnv WidowX Robot tasks (test)	Success Rate (Spoon)75	79
Robot Manipulation	SimplerEnv Google Robot tasks Visual Matching	Pick Coke Can Success Rate88.3	62
Robot Manipulation	LIBERO-Plus Zero-shot	Camera Score64.2	42
Robot Manipulation	SimplerEnv WidowX Visual Matching	Average Success Rate57.3	34
Robotic Manipulation	LIBERO-Plus (test)	Language Robustness Score85.4	32
Tabletop manipulation	LIBERO	Success Rate96.1	17
Robot Manipulation	SimplerEnv WidowX (held-out)	Put Spoon on Towel Success Rate75	14

Showing 10 of 10 rows

Other info

GitHub

Follow for update

@wizwand_team Discord