Knowing Where to Focus: Event-aware Transformer for Video Grounding
About
Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. To this end, we present two levels of reasoning: 1) Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism; and 2) moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps. Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Moment Retrieval | QVHighlights (test) | R@1 (IoU=0.5)61.36 | 170 | |
| Highlight Detection | QVHighlights (test) | HIT@158.65 | 151 | |
| Video Grounding | Charades-STA | R@1 IoU=0.568.4 | 113 | |
| Video Moment Retrieval | Charades-STA (test) | Recall@1 (IoU=0.5)68.47 | 77 | |
| Video Grounding | QVHighlights (test) | mAP (IoU=0.5)59.95 | 64 | |
| Moment Retrieval | QVHighlights (val) | R@1 (IoU=0.5)61.4 | 53 | |
| Temporal Grounding | ActivityNet Captions | Recall@1 (IoU=0.5)58.2 | 45 | |
| Video Moment Retrieval | Charades-STA | R1@0.568.47 | 44 | |
| Highlight Detection | QVHighlights (val) | HIT@158.7 | 35 | |
| Temporal Grounding | Charades-STA | -- | 33 |