Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

About

Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions. The key to this problem is to measure text-video similarities in a joint embedding space. However, most existing methods only consider the global cross-modal similarity and overlook the local details. Some works incorporate the local comparisons through cross-modal local matching and reasoning. These complex operations introduce tremendous computation. In this paper, we design an efficient global-local alignment method. The multi-modal video sequences and text features are adaptively aggregated with a set of shared semantic centers. The local cross-modal similarities are computed between the video feature and text feature within the same center. This design enables the meticulous local comparison and reduces the computational cost of the interaction between each text-video pair. Moreover, a global alignment method is proposed to provide a global cross-modal measurement that is complementary to the local perspective. The global aggregated visual features also provide additional supervision, which is indispensable to the optimization of the learnable semantic centers. We achieve consistent improvements on three standard text-video retrieval benchmarks and outperform the state-of-the-art by a clear margin.

Xiaohan Wang, Linchao Zhu, Yi Yang• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSR-VTT
Recall@129.5
369
Text-to-Video RetrievalActivityNet
R@10.237
238
Text-to-Video RetrievalMSR-VTT (1k-A)
R@1070.1
211
Video-to-Text retrievalMSR-VTT
Recall@131.8
185
Text-to-Video RetrievalLSMDC
R@114.3
167
Video-to-Text retrievalActivityNet
R@10.241
115
Video-to-Text retrievalMSR-VTT (1k-A)
Recall@560
74
Video-to-Text retrievalLSMDC
R@114.2
64
Text-to-Video RetrievalActivityNet-captions (val1)
R@123.7
58
Text-to-Video RetrievalMSR-VTT 9K
R@129.5
55
Showing 10 of 20 rows

Other info

Follow for update