Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Straightforward Framework For Video Retrieval Using CLIP

About

Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.

Jes\'us Andr\'es Portillo-Quintero, Jos\'e Carlos Ortiz-Bayliss, Hugo Terashima-Mar\'in• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSR-VTT
Recall@131.2
313
Text-to-Video RetrievalMSR-VTT (test)
R@131.2
234
Text-to-Video RetrievalLSMDC (test)
R@111.3
225
Text-to-Video RetrievalMSVD
R@137
218
Text-to-Video RetrievalMSR-VTT (1k-A)
R@1064.2
211
Text-to-Video RetrievalMSVD (test)
R@137
204
Video-to-Text retrievalMSR-VTT
Recall@127.2
157
Text-to-Video RetrievalLSMDC
R@111.3
154
Video-to-Text retrievalMSVD
R@159.9
93
Video-to-Text retrievalMSR-VTT (1k-A)
Recall@582.5
74
Showing 10 of 22 rows

Other info

Code

Follow for update