Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
About
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-form video-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capture the rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we introduce two novel designs in our LF-VILA model. We first propose a Multimodal Temporal Contrastive (MTC) loss to learn the temporal relation across different modalities by encouraging fine-grained alignment between long-form videos and paragraphs. Second, we propose a Hierarchical Temporal Window Attention (HTWA) mechanism to effectively capture long-range dependency while reducing computational cost in Transformer. We fine-tune the pre-trained LF-VILA model on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video question-answering, and achieve new state-of-the-art performances. Specifically, our model achieves 16.1% relative improvement on ActivityNet paragraph-to-video retrieval task and 2.4% on How2QA task, respectively. We release our code, dataset, and pre-trained models at https://github.com/microsoft/XPretrain.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | DiDeMo (test) | R@135.3 | 376 | |
| Text-to-Video Retrieval | DiDeMo | R@10.35 | 360 | |
| Video Question Answering | ActivityNet-QA | Accuracy39.9 | 319 | |
| Video Question Answering | ActivityNet-QA (test) | Accuracy39.9 | 275 | |
| Text-to-Video Retrieval | MSRVTT (test) | Recall@10.35 | 155 | |
| Text-to-Video Retrieval | ActivityNet (test) | R@135.3 | 108 | |
| Video Question Answering | How2QA | Acc76.1 | 47 | |
| Long-form Video Understanding | LVU | Relation Attribute Accuracy61.5 | 44 | |
| Paragraph-to-Video Retrieval | ActivityNet-Captions (test) | R@135.3 | 22 | |
| Paragraph-to-Video Retrieval | QuerYD (test) | R@169.7 | 12 |