RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

About

Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.

Tanveer Hannan, Md Mohaiminul Islam, Thomas Seidl, Gedas Bertasius• 2023

Related benchmarks

Task	Dataset	Result
Video Grounding	MAD (test)	Recall@1 (IoU=0.1)12.43	35
Video Grounding	Ego4D-NLQ v1 (test)	Recall@1 (Avg)24.96	27
Video Grounding	MAD 1.0 (test)	R@1 (IoU=0.3)9.48	26
Temporal Grounding	Ego4D NLQ (test)	R@1 (IoU=0.3)20.63	20
Video Temporal Grounding	Ego4D NLQ v1 (val)	R@1 (IoU=0.3)18.28	12
Temporal Grounding	Ego4D Goalstep (test)	R@1 (Th=0.3)21.26	11
Long Video Moment Retrieval	MAD (test)	Recall@1 (Tol 0.1)12.4	10
Long Video Temporal Grounding	Ego4D	Average Recall21.81	9
Video Temporal Grounding	Ego4D NLQ (test)	Recall@5 (IoU=0.3)34.02	8
Video Temporal Grounding	MAD v2 (test)	Recall@1 (IoU=0.3)13.02	4

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord