Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cross-Modal and Hierarchical Modeling of Video and Text

About

Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (HSE), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.

Bowen Zhang, Hexiang Hu, Fei Sha• 2018

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo (test)
R@113.9
376
Text-to-Video RetrievalDiDeMo
R@10.302
360
Text-to-Video RetrievalMSR-VTT
Recall@132.9
313
Text-to-Video RetrievalActivityNet
R@144.4
197
Text-to-Video RetrievalYouCook2--
117
Video-to-Text retrievalDiDeMo
R@130.1
108
Text-to-Video RetrievalActivityNet (test)
R@120.5
108
Video-to-Text retrievalActivityNet
R@144.2
99
Video-to-Text retrievalDiDeMo (test)
R@113.1
92
Video-to-Text retrievalActivityNet (test)
R@118.7
63
Showing 10 of 26 rows

Other info

Follow for update