Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

About

In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video. Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal annotation during training. We propose a two-stage model to tackle this problem in a coarse-to-fine manner. In the coarse stage, we first generate a set of fixed-length temporal proposals using multi-scale sliding windows, and match their visual features against the sentence features to identify the best-matched proposal as a coarse grounding result. In the fine stage, we perform a fine-grained matching between the visual features of the frames in the best-matched proposal and the sentence features to locate the precise frame boundary of the fine grounding result. Comprehensive experiments on the ActivityNet Captions dataset and the Charades-STA dataset demonstrate that our two-stage model achieves compelling performance.

Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, Kwan-Yee K. Wong• 2020

Related benchmarks

Task	Dataset	Result
Temporal Grounding	Charades-STA (test)	Recall@1 (IoU=0.5)27.3	68
Video Grounding	ActivityNet Captions 1.3 (test val)	R@1 (IoU=0.5)23.6	21
Temporal Sentence Grounding	Charades-STA (test)	IoU@0.527.3	16
Temporal Sentence Grounding	ActivityNet Captions v1.3 (test)	Recall (IoU=0.3)44.3	16
Video Moment Retrieval	Charades	Rank@1 (IoU=0.3)39.8	16
Video Moment Retrieval	ActivityNet	Rank@1 (IoU=0.3)44.3	15

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord