DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

About

Recent advances in self-supervised visual representation learning have demonstrated the effectiveness of predictive latent-space objectives for learning transferable features. In particular, Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns representations by predicting latent embeddings of masked target regions from visible context. However, it predicts target regions in parallel and all at once, lacking ability to order predictions meaningfully. Inspired by human visual perception, which attends selectively and progressively from primary to secondary cues, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges latent predictive and autoregressive self-supervised learning. Specifically, DSeq-JEPA integrates a discriminatively ordered sequential process with JEPA-style learning objective. This is achieved by (i) identifying primary discriminative regions using an attention-derived saliency map that serves as a proxy for visual importance, and (ii) predicting subsequent regions in discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues in pre-training. Extensive experiments across tasks -- image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB, Stanford Cars), detection/segmentation (MS-COCO, ADE20K), and low-level reasoning (CLEVR) -- show that DSeq-JEPA consistently learns more discriminative and generalizable representations compared to I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.

Xiangteng He, Shunsuke Sakai, Shivam Chandhok, Sara Beery, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa, Leonid Sigal• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU48.9	1028
Instance Segmentation	MS-COCO	mAP Mask45.7	123
Fine-grained Image Classification	CUB	Top-1 Acc68.9	78
Image Classification	ImageNet	Top-1 Acc87.8	65
Fine-grained Visual Categorization	iNat 21	Top-1 Accuracy39.7	13
Fine-grained Visual Categorization	Cars	Top-1 Accuracy70.1	13
Multi-dataset Evaluation	Average (ImageNet, iNat21, CUB, Cars)	Average Top-1 Acc69.7	12
Object Counting	Clevr Count (test)	Accuracy87.1	9
Distance Estimation	Clevr Dist (test)	Top-1 Accuracy71.9	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord