Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

About

Recent advances in self-supervised visual representation learning have demonstrated the effectiveness of predictive latent-space objectives for learning transferable features. In particular, Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns representations by predicting latent embeddings of masked target regions from visible context. However, it predicts target regions in parallel and all at once, lacking ability to order predictions meaningfully. Inspired by human visual perception, which attends selectively and progressively from primary to secondary cues, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges latent predictive and autoregressive self-supervised learning. Specifically, DSeq-JEPA integrates a discriminatively ordered sequential process with JEPA-style learning objective. This is achieved by (i) identifying primary discriminative regions using an attention-derived saliency map that serves as a proxy for visual importance, and (ii) predicting subsequent regions in discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues in pre-training. Extensive experiments across tasks -- image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB, Stanford Cars), detection/segmentation (MS-COCO, ADE20K), and low-level reasoning (CLEVR) -- show that DSeq-JEPA consistently learns more discriminative and generalizable representations compared to I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.

Xiangteng He, Shunsuke Sakai, Shivam Chandhok, Sara Beery, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa, Leonid Sigal• 2025

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU48.9
1024
Image ClassificationImageNet
Top-1 Acc87.8
65
Instance SegmentationMS-COCO
mAP Mask45.7
60
Fine-grained Image ClassificationCUB
Top-1 Acc68.9
35
Fine-grained Visual CategorizationiNat 21
Top-1 Accuracy39.7
13
Fine-grained Visual CategorizationCars
Top-1 Accuracy70.1
13
Multi-dataset EvaluationAverage (ImageNet, iNat21, CUB, Cars)
Average Top-1 Acc69.7
12
Object CountingClevr Count (test)
Accuracy87.1
9
Distance EstimationClevr Dist (test)
Top-1 Accuracy71.9
4
Showing 9 of 9 rows

Other info

Follow for update