Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SnAG: Scalable and Accurate Video Grounding

About

Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.

Fangzhou Mu, Sicheng Mo, Yin Li• 2024

Related benchmarks

TaskDatasetResultRank
Video GroundingCharades-STA
R@1 IoU=0.565.13
113
Video GroundingTACOS
Recall@1 (IoU=0.5)44.86
45
Video Moment RetrievalCharades-STA
R1@0.562.9
44
Video GroundingActivityNet Captions
R@1 (IoU=0.5)48.55
43
Video GroundingMAD (test)
Recall@1 (IoU=0.1)10.3
35
Moment RetrievalTACOS (test)
Recall@1 (IoU=0.5)44.86
23
Video GroundingEgo4D-NLQ v1 (test)
Recall@1 (IoU=0.3)15.87
21
Temporal GroundingEgo4D NLQ (test)
R@1 (IoU=0.3)15.87
20
Video Event GroundingActivityNet
Recall@0.548.6
17
Temporal GroundingEgo4D-NLQ
R@1 (IoU=0.3)15.72
14
Showing 10 of 15 rows

Other info

Code

Follow for update