RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
About
Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Grounding | MAD (test) | Recall@1 (IoU=0.1)12.43 | 35 | |
| Video Grounding | Ego4D-NLQ v1 (test) | Recall@1 (IoU=0.3)20.63 | 21 | |
| Temporal Grounding | Ego4D NLQ (test) | R@1 (IoU=0.3)20.63 | 20 | |
| Video Grounding | MAD 1.0 (test) | R@1 (IoU=0.1)12.43 | 17 | |
| Temporal Grounding | Ego4D Goalstep (test) | R@1 (Th=0.3)21.26 | 11 | |
| Long Video Moment Retrieval | MAD (test) | Recall@1 (Tol 0.1)12.4 | 10 |