HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
About
Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy74.06 | 664 | |
| Video Question Answering | MSRVTT-QA | Accuracy45.9 | 481 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy74.28 | 466 | |
| Text-to-Video Retrieval | DiDeMo (test) | R@156.5 | 376 | |
| Video Question Answering | MSRVTT-QA (test) | Accuracy45.9 | 371 | |
| Text-to-Video Retrieval | DiDeMo | R@10.565 | 360 | |
| Video Question Answering | MSVD-QA | Accuracy55.3 | 340 | |
| Video Question Answering | ActivityNet-QA | Accuracy46.4 | 319 | |
| Text-to-Video Retrieval | MSR-VTT | Recall@146.8 | 313 | |
| Image-to-Text Retrieval | MS-COCO 5K (test) | R@172.4 | 299 |