UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

About

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou• 2020

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	MSR-VTT	Recall@121.2	406
Multimodal Sentiment Analysis	CMU-MOSI (test)	F184.6	385
Text-to-Video Retrieval	MSR-VTT (test)	R@121.2	265
Text-to-Video Retrieval	MSR-VTT (1k-A)	R@1063.1	211
Multimodal Sentiment Analysis	CMU-MOSI	--	166
Video Captioning	MSVD	CIDEr52.8	157
Video Captioning	MSR-VTT (test)	CIDEr50.1	142
Text-to-Video Retrieval	YouCook2	Recall@1070	117
Video Captioning	MSVD (test)	CIDEr52.8	111
Video Captioning	YouCook2	METEOR22.35	108

Showing 10 of 44 rows

Other info

Code

Follow for update

@wizwand_team Discord