VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

About

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, Pascale Fung• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	GQA	Accuracy61.5	524
Text-to-Video Retrieval	MSR-VTT	Recall@140	406
Visual Question Answering	POPE	Accuracy85.7	136
Video Classification	Kinetics-400	Top-1 Acc64.8	131
Visual Question Answering	TallyQA	Accuracy69.9	49
Step Forecasting	COIN	Accuracy56.2	26
World Modeling	WorldPrediction-WM	Accuracy65.7	20
Visual Question Answering	POPE v2	Accuracy86.3	15
Video Classification	SS v2 (test val)	Top-1 Accuracy73.2	12
Video Classification	EK100 (test val)	Top-1 Acc44.6	12

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord