VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

About

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer• 2021

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	DiDeMo	R@10.166	465
Text-to-Video Retrieval	DiDeMo (test)	R@116.6	407
Text-to-Video Retrieval	MSR-VTT	Recall@130.9	406
Text-to-Video Retrieval	MSR-VTT (test)	R@130.9	265
Text-to-Video Retrieval	MSR-VTT (1k-A)	R@1066.8	211
Text-to-Video Retrieval	MSRVTT (test)	Recall@50.554	178
Text-to-Video Retrieval	MSRVTT	R@130.9	144
Video-to-Text retrieval	DiDeMo	R@116.6	136
Text-to-Video Retrieval	YouCook2	Recall@1075	117
Video-to-Text retrieval	DiDeMo (test)	R@116.6	111

Showing 10 of 76 rows

...

Other info

Code

Follow for update

@wizwand_team Discord