VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

About

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer• 2021

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	MSR-VTT	Recall@128.1	406
Text-to-Video Retrieval	MSR-VTT (test)	R@128.1	265
Text-to-Video Retrieval	YouCook2	Recall@1069.4	117
Video Captioning	YouCook2	METEOR18.22	108
Video Captioning	YouCook II (val)	CIDEr138.7	98
Text-to-Video Retrieval	Youcook2 (test)	Recall@1069.4	59
Text-to-Video Retrieval	MSR-VTT 1k-A (test)	R@128.1	57
Video Question Answering	MSR-VTT	Accuracy91.64	42
Action Step Localization	CrossTask (test)	Recall46.5	32
Action Segmentation	COIN	Frame Accuracy68.39	29

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord