Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

About

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSR-VTT (test)
R@147.2
234
Text-to-Video RetrievalMSVD
R@158.7
218
Text-to-Video RetrievalMSR-VTT (1k-A)
R@1081.7
211
Text-to-Video RetrievalMSVD (test)
R@147
204
Video-to-Text retrievalMSVD
R@158.7
93
Video-to-Text retrievalMSR-VTT (1k-A)
Recall@572.3
74
Text-to-Video RetrievalVATEX (test)
R@157.3
62
Video-to-Text retrievalMSVD (test)
R@158.7
61
Text-to-Video RetrievalMSR-VTT 1k-A (test)
R@147.2
57
Text-to-Video RetrievalMSR-VTT (Full)
R@129.8
55
Showing 10 of 28 rows

Other info

Code

Follow for update