Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Text-Video Retrieval with Global-Local Semantic Consistent Learning

About

Adapting large-scale image-text pre-training models, e.g., CLIP, to the video domain represents the current state-of-the-art for text-video retrieval. The primary approaches involve transferring text-video pairs to a common embedding space and leveraging cross-modal interactions on specific entities for semantic alignment. Though effective, these paradigms entail prohibitive computational costs, leading to inefficient retrieval. To address this, we propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL), which capitalizes on latent shared semantics across modalities for text-video retrieval. Specifically, we introduce a parameter-free global interaction module to explore coarse-grained alignment. Then, we devise a shared local interaction module that employs several learnable queries to capture latent semantic concepts for learning fine-grained alignment. Furthermore, an Inter-Consistency Loss (ICL) is devised to accomplish the concept alignment between the visual query and corresponding textual query, and an Intra-Diversity Loss (IDL) is developed to repulse the distribution within visual (textual) queries to generate more discriminative concepts. Extensive experiments on five widely used benchmarks (i.e., MSR-VTT, MSVD, DiDeMo, LSMDC, and ActivityNet) substantiate the superior effectiveness and efficiency of the proposed method. Remarkably, our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost. Code is available at: https://github.com/zchoi/GLSCL.

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan, Xinyu Lyu, Hengtao Shen• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo
R@10.491
459
Text-to-Video RetrievalMSRVTT
Recall@149.9
38
Video-to-Text retrievalMSRVTT
R@148.3
35
Text-to-Video RetrievalTRECVid V3C1 2019 (tv19)
xinfAP14.2
16
Text-to-Video RetrievalTRECVid IACC.3 2016
xinfAP13.2
16
Text-to-Video RetrievalTRECVid IACC.3 2017 (tv17)
xinfAP18.5
16
Text-to-Video RetrievalTRECVid IACC.3 2018 (tv18)
xinfAP7.5
16
Text-to-Video RetrievalTRECVid V3C1 2020 (tv20)
xinfAP0.206
15
Text-to-Video RetrievalTRECVid V3C2 2022
xinfAP12.7
13
Text-to-Video RetrievalTRECVid V3C2 2023 (tv23)
xinfAP11.9
13
Showing 10 of 11 rows

Other info

Follow for update