Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

About

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, Pascale Fung• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy61.5
374
Text-to-Video RetrievalMSR-VTT
Recall@140
313
Video ClassificationKinetics-400
Top-1 Acc64.8
131
Visual Question AnsweringPOPE
Accuracy85.7
71
Visual Question AnsweringTallyQA
Accuracy69.9
29
Step ForecastingCOIN
Accuracy56.2
22
World ModelingWorldPrediction-WM
Accuracy65.7
20
Video ClassificationSS v2 (test val)
Top-1 Accuracy73.2
12
Video ClassificationEK100 (test val)
Top-1 Acc44.6
12
Step RecognitionCOIN (test)
Top-1 Acc66.4
11
Showing 10 of 12 rows

Other info

Follow for update