Relation-aware Video Reading Comprehension for Temporal Language Grounding

About

Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes have been available.

Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, Bernard Ghanem• 2021

Related benchmarks

Task	Dataset	Result
Moment Retrieval	Charades-STA (test)	R@0.542.91	186
Video Grounding	Charades-STA	R@1 IoU=0.560.4	113
Video Moment Retrieval	TACOS (test)	Recall@1 (0.5 Threshold)33.54	111
Video Moment Retrieval	Charades-STA (test)	Recall@1 (IoU=0.5)39.65	91
Video Moment Retrieval	Charades-STA	R1@0.543.87	57
Video Grounding	TACOS	Recall@1 (IoU=0.5)33.54	45
Video Grounding	ActivityNet Captions	R@1 (IoU=0.5)45.59	43
Temporal Sentence Grounding	TACOS (test)	R@1 (IoU=0.5)33.54	37
Video Moment Retrieval	ActivityNet Captions	R@1 (IoU=0.5)28.43	20
Natural Language Video Grounding	TACoS (val)	Recall@1 (IoU=0.3)43.34	16

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord