SnAG: Scalable and Accurate Video Grounding

About

Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.

Fangzhou Mu, Sicheng Mo, Yin Li• 2024

Related benchmarks

Task	Dataset	Result
Video Grounding	Charades-STA	R@1 IoU=0.565.13	113
Video Moment Retrieval	Charades-STA	R1@0.562.9	57
Video Grounding	TACOS	Recall@1 (IoU=0.5)44.86	45
Video Temporal Grounding	ActivityNet Captions	Recall @ IoU=0.548.55	43
Video Grounding	ActivityNet Captions	R@1 (IoU=0.5)48.55	43
Video Grounding	MAD (test)	Recall@1 (IoU=0.1)10.3	35
Temporal Video Grounding	ActivityNet-Captions (test)	Recall@IoU>0.581.71	32
Temporal Video Grounding	TACOS	Recall@1 (tIoU=0.5)44.86	27
Video Grounding	Ego4D-NLQ v1 (test)	Recall@1 (Avg)13.57	27
Video Grounding	MAD 1.0 (test)	R@1 (IoU=0.3)8.46	26

Showing 10 of 30 rows

Other info

Code

Follow for update

@wizwand_team Discord