Knowing Where to Focus: Event-aware Transformer for Video Grounding

About

Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. To this end, we present two levels of reasoning: 1) Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism; and 2) moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps. Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.

Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, Kwanghoon Sohn• 2023

Related benchmarks

Task	Dataset	Result
Moment Retrieval	QVHighlights (test)	R@1 (IoU=0.5)61.36	223
Highlight Detection	QVHighlights (test)	HIT@158.65	167
Video Grounding	Charades-STA	R@1 IoU=0.568.4	113
Temporal Grounding	Charades-STA	--	107
Video Moment Retrieval	Charades-STA (test)	Recall@1 (IoU=0.5)68.47	91
Temporal Grounding	ActivityNet Captions	Recall@1 (IoU=0.5)58.2	85
Video Grounding	QVHighlights (test)	mAP (IoU=0.5)59.95	64
Moment Retrieval	QVHighlights (val)	R@1 (IoU=0.5)61.4	61
Video Moment Retrieval	Charades-STA	R1@0.568.47	57
Temporal Video Grounding	Charades-STA	Rank-1 Recall (IoU=0.5)68.4	47

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord