Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

About

Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.

Ashutosh Kumar, Rajat Saini, Jingjing Pan, Mustafa Erdogan, Mingfang Zhang, Betty Le Dem, Norimasa Kobori, Quan Kong• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo
R@10.54
459
Text-to-Video RetrievalMSR-VTT
Recall@141.1
369
Text-to-Video RetrievalMSVD
R@149.2
264
Text-to-Video RetrievalActivityNet
R@10.507
238
Text-to-Video RetrievalLSMDC
R@123.5
167
Cross-modal retrievalInstVL img 1K Instance
T2V R@150.25
12
Cross-modal retrievalInstVL img 1K (Global)
T2V R@199.2
12
Cross-modal retrievalInstVL img 10K
T2V Recall@144.05
12
Cross-modal retrievalInstVL img 10K (Global)
T2V Recall@195.77
12
Cross-modal retrievalInstVL img-zero 1K (Instance)
T2V R@141.94
12
Showing 10 of 15 rows

Other info

Follow for update