Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Cross-Modal and Hierarchical Modeling of Video and Text

About

Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (HSE), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.

Bowen Zhang, Hexiang Hu, Fei Sha• 2018

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo
R@10.302
459
Text-to-Video RetrievalDiDeMo (test)
R@113.9
399
Text-to-Video RetrievalMSR-VTT
Recall@132.9
369
Text-to-Video RetrievalActivityNet
R@144.4
238
Video-to-Text retrievalDiDeMo
R@130.1
130
Text-to-Video RetrievalYouCook2--
117
Video-to-Text retrievalActivityNet
R@144.2
115
Video-to-Text retrievalDiDeMo (test)
R@113.1
111
Text-to-Video RetrievalActivityNet (test)
R@120.5
108
Video-to-Text retrievalActivityNet (test)
R@118.7
63
Showing 10 of 26 rows

Other info

Follow for update