A Straightforward Framework For Video Retrieval Using CLIP
About
Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.
Jes\'us Andr\'es Portillo-Quintero, Jos\'e Carlos Ortiz-Bayliss, Hugo Terashima-Mar\'in• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSR-VTT | Recall@131.2 | 313 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@131.2 | 234 | |
| Text-to-Video Retrieval | LSMDC (test) | R@111.3 | 225 | |
| Text-to-Video Retrieval | MSVD | R@137 | 218 | |
| Text-to-Video Retrieval | MSR-VTT (1k-A) | R@1064.2 | 211 | |
| Text-to-Video Retrieval | MSVD (test) | R@137 | 204 | |
| Video-to-Text retrieval | MSR-VTT | Recall@127.2 | 157 | |
| Text-to-Video Retrieval | LSMDC | R@111.3 | 154 | |
| Video-to-Text retrieval | MSVD | R@159.9 | 93 | |
| Video-to-Text retrieval | MSR-VTT (1k-A) | Recall@582.5 | 74 |
Showing 10 of 22 rows