Text-Conditional JEPA for Learning Semantically Rich Visual Representations

About

Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.

Chen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU58.8	699
Visual Question Answering	VQA v2	Accuracy57.8	347
Image Classification	CIFAR100	Accuracy91.6	301
Visual Question Answering	GQA	Accuracy46.3	218
Semantic segmentation	Pascal VOC	mIoU83.8	214
Image Classification	iNaturalist 18	Overall Accuracy54.8	151
Image Classification	ImageNet-1K	Accuracy82.1	133
Classification	CIFAR100	Accuracy88.5	90
Object Detection	COCO	AP^b58	49
Classification	Places 205	Top-1 Acc59.1	18

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord