Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

About

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo
R@10.166
465
Text-to-Video RetrievalDiDeMo (test)
R@116.6
407
Text-to-Video RetrievalMSR-VTT
Recall@130.9
406
Text-to-Video RetrievalMSR-VTT (test)
R@130.9
265
Text-to-Video RetrievalMSR-VTT (1k-A)
R@1066.8
211
Text-to-Video RetrievalMSRVTT (test)
Recall@50.554
178
Text-to-Video RetrievalMSRVTT
R@130.9
144
Video-to-Text retrievalDiDeMo
R@116.6
136
Text-to-Video RetrievalYouCook2
Recall@1075
117
Video-to-Text retrievalDiDeMo (test)
R@116.6
111
Showing 10 of 76 rows
...

Other info

Code

Follow for update