Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

About

Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo. We release our code at https://github.com/ArrowLuo/CLIP4Clip.

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo (test)
R@147.3
376
Text-to-Video RetrievalDiDeMo
R@10.434
360
Text-to-Video RetrievalMSR-VTT
Recall@146.4
313
Text-to-Video RetrievalMSR-VTT (test)
R@144.5
234
Text-to-Video RetrievalLSMDC (test)
R@124.1
225
Text-to-Video RetrievalMSVD
R@147.3
218
Text-to-Video RetrievalMSR-VTT (1k-A)
R@1081.6
211
Text-to-Video RetrievalMSVD (test)
R@149.6
204
Text-to-Video RetrievalActivityNet
R@10.405
197
Video-to-Text retrievalMSR-VTT
Recall@145.9
157
Showing 10 of 84 rows
...

Other info

Code

Follow for update