VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
About
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | DiDeMo (test) | R@116.6 | 376 | |
| Text-to-Video Retrieval | DiDeMo | R@10.166 | 360 | |
| Text-to-Video Retrieval | MSR-VTT | Recall@130.9 | 313 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@130.9 | 234 | |
| Text-to-Video Retrieval | MSR-VTT (1k-A) | R@1066.8 | 211 | |
| Text-to-Video Retrieval | MSRVTT (test) | Recall@10.309 | 155 | |
| Text-to-Video Retrieval | YouCook2 | Recall@1075 | 117 | |
| Text-to-Video Retrieval | MSRVTT | R@130.9 | 98 | |
| Video-to-Text retrieval | DiDeMo (test) | R@116.6 | 92 | |
| Text-to-Video Retrieval | MSRVTT | R@130.9 | 75 |
Showing 10 of 71 rows
...