An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

About

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu• 2022

Related benchmarks

Task	Dataset	Result
Video Question Answering	MSRVTT-QA	Accuracy44.5	505
Text-to-Video Retrieval	DiDeMo	R@10.479	465
Text-to-Video Retrieval	DiDeMo (test)	R@147.9	407
Video Question Answering	MSVD-QA	Accuracy54.7	393
Video Question Answering	MSRVTT-QA (test)	Accuracy44.5	376
Video Question Answering	MSVD-QA (test)	Accuracy54.7	279
Text-to-Video Retrieval	LSMDC (test)	R@543.5	245
Text-to-Video Retrieval	ActivityNet	R@10.181	245
Video Question Answering	EgoSchema (Full)	Accuracy19.9	241
Video-to-Text retrieval	MSR-VTT	Recall@137.2	221

Showing 10 of 41 rows

Other info

Code

Follow for update

@wizwand_team Discord