SnAG: Scalable and Accurate Video Grounding
About
Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Grounding | Charades-STA | R@1 IoU=0.565.13 | 113 | |
| Video Grounding | TACOS | Recall@1 (IoU=0.5)44.86 | 45 | |
| Video Moment Retrieval | Charades-STA | R1@0.562.9 | 44 | |
| Video Grounding | ActivityNet Captions | R@1 (IoU=0.5)48.55 | 43 | |
| Video Grounding | MAD (test) | Recall@1 (IoU=0.1)10.3 | 35 | |
| Moment Retrieval | TACOS (test) | Recall@1 (IoU=0.5)44.86 | 23 | |
| Video Grounding | Ego4D-NLQ v1 (test) | Recall@1 (IoU=0.3)15.87 | 21 | |
| Temporal Grounding | Ego4D NLQ (test) | R@1 (IoU=0.3)15.87 | 20 | |
| Video Event Grounding | ActivityNet | Recall@0.548.6 | 17 | |
| Temporal Grounding | Ego4D-NLQ | R@1 (IoU=0.3)15.72 | 14 |