A Straightforward Framework For Video Retrieval Using CLIP

About

Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.

Jes\'us Andr\'es Portillo-Quintero, Jos\'e Carlos Ortiz-Bayliss, Hugo Terashima-Mar\'in• 2021

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	MSR-VTT	Recall@131.2	406
Text-to-Video Retrieval	MSVD	R@137	290
Text-to-Video Retrieval	MSR-VTT (test)	R@131.2	265
Text-to-Video Retrieval	LSMDC (test)	R@522.7	245
Video-to-Text retrieval	MSR-VTT	Recall@127.2	221
Text-to-Video Retrieval	MSR-VTT (1k-A)	R@1064.2	211
Text-to-Video Retrieval	MSVD (test)	R@137	211
Text-to-Video Retrieval	LSMDC	R@111.3	181
Video-to-Text retrieval	MSVD	R@159.9	119
Video-to-Text retrieval	LSMDC	R@16.8	92

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord