Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

About

Video-Text Retrieval has been a hot research topic with the growth of multimedia data on the internet. Transformer for video-text learning has attracted increasing attention due to its promising performance. However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Exploitation of the transformer architecture where different layers have different feature characteristics is limited; 2) End-to-end training mechanism limits negative sample interactions in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative sample interactions on-the-fly, which contributes to the generation of more precise and discriminative representations. Experimental results on the three major Video-Text Retrieval benchmark datasets demonstrate the advantages of our method.

Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, Zhongyuan Wang• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSR-VTT
Recall@130.7
313
Text-to-Video RetrievalMSR-VTT (test)
R@130.7
234
Text-to-Video RetrievalLSMDC (test)
R@114
225
Text-to-Video RetrievalMSR-VTT (1k-A)
R@1073.2
211
Text-to-Video RetrievalActivityNet
R@10.296
197
Video-to-Text retrievalMSR-VTT
Recall@132.1
157
Text-to-Video RetrievalLSMDC
R@114
154
Text-to-Video RetrievalActivityNet (test)
R@129.6
108
Text-to-Video RetrievalMSRVTT
R@130.7
75
Video-to-Text retrievalMSR-VTT (1k-A)
Recall@562.7
74
Showing 10 of 20 rows

Other info

Follow for update