Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

About

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo (test)
R@116.6
376
Text-to-Video RetrievalDiDeMo
R@10.166
360
Text-to-Video RetrievalMSR-VTT
Recall@130.9
313
Text-to-Video RetrievalMSR-VTT (test)
R@130.9
234
Text-to-Video RetrievalMSR-VTT (1k-A)
R@1066.8
211
Text-to-Video RetrievalMSRVTT (test)
Recall@10.309
155
Text-to-Video RetrievalYouCook2
Recall@1075
117
Text-to-Video RetrievalMSRVTT
R@130.9
98
Video-to-Text retrievalDiDeMo (test)
R@116.6
92
Text-to-Video RetrievalMSRVTT
R@130.9
75
Showing 10 of 71 rows
...

Other info

Code

Follow for update