UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
About
With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSR-VTT | Recall@121.2 | 313 | |
| Multimodal Sentiment Analysis | CMU-MOSI (test) | F184.6 | 238 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@121.2 | 234 | |
| Text-to-Video Retrieval | MSR-VTT (1k-A) | R@1063.1 | 211 | |
| Video Captioning | MSVD | CIDEr52.8 | 128 | |
| Video Captioning | MSR-VTT (test) | CIDEr50.1 | 121 | |
| Text-to-Video Retrieval | YouCook2 | Recall@1070 | 117 | |
| Video Captioning | MSVD (test) | CIDEr52.8 | 111 | |
| Video Captioning | YouCook2 | METEOR22.35 | 104 | |
| Video Captioning | MSRVTT | CIDEr50 | 101 |