HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
About
Video-Text Retrieval has been a hot research topic with the growth of multimedia data on the internet. Transformer for video-text learning has attracted increasing attention due to its promising performance. However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Exploitation of the transformer architecture where different layers have different feature characteristics is limited; 2) End-to-end training mechanism limits negative sample interactions in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative sample interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on the three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our method.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSR-VTT | Recall@130.7 | 313 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@130.7 | 234 | |
| Text-to-Video Retrieval | LSMDC (test) | R@114 | 225 | |
| Text-to-Video Retrieval | MSR-VTT (1k-A) | R@1073.2 | 211 | |
| Text-to-Video Retrieval | ActivityNet | R@10.296 | 197 | |
| Video-to-Text retrieval | MSR-VTT | Recall@132.1 | 157 | |
| Text-to-Video Retrieval | LSMDC | R@114 | 154 | |
| Text-to-Video Retrieval | ActivityNet (test) | R@129.6 | 108 | |
| Text-to-Video Retrieval | MSRVTT | R@130.7 | 75 | |
| Video-to-Text retrieval | MSR-VTT (1k-A) | Recall@562.7 | 74 |