Cross-Modal and Hierarchical Modeling of Video and Text

About

Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (HSE), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.

Bowen Zhang, Hexiang Hu, Fei Sha• 2018

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	DiDeMo	R@10.302	472
Text-to-Video Retrieval	DiDeMo (test)	R@113.9	407
Text-to-Video Retrieval	MSR-VTT	Recall@132.9	406
Text-to-Video Retrieval	ActivityNet	R@144.4	255
Video-to-Text retrieval	ActivityNet	R@144.2	160
Video-to-Text retrieval	DiDeMo	R@130.1	136
Text-to-Video Retrieval	YouCook2	--	117
Video-to-Text retrieval	DiDeMo (test)	R@113.1	111
Text-to-Video Retrieval	ActivityNet (test)	R@120.5	108
Video-to-Text retrieval	ActivityNet (test)	R@118.7	63

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord