SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

About

A truly capable AI system must do more than detect objects or recognize activities in isolation. It must form unified, grounded representations of who is acting, what they are doing, and when and where these actions unfold. These representations provide the perceptual bedrock for high-level reasoning, planning, and embodied interaction in the real world. Building such agents is central to long-horizon goals in embodied AI and robotics. Current video benchmarks evaluate fragments of these capabilities in isolation. They focus on either spatial grounding, object tracking, or temporal localization. As a result, they cannot rigorously measure progress on their joint, multi-instance integration. We introduce Spatio-temporal Video Action Grounding (SVAG), a task and benchmark that explicitly targets this unified competence by requiring models to simultaneously detect, track, and temporally localize all objects that satisfy a natural language query in complex, multi-actor scenes. To support this task, we construct SVAG-Bench. It comprises 688 videos, 19,590 verified annotations, and 903 unique action verbs drawn from crowded urban environments, wildlife, and traffic surveillance. Each video has on average 28.5 action-centric queries. This yields the densest annotation among comparable video grounding benchmarks and enables fine-grained evaluation of multi-actor disambiguation, temporal overlap, and action compositionality. Annotations are produced by a pipeline that combines expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification to ensure both linguistic diversity and correctness. We further release SVAGEval, a standardized multi-referent evaluation toolkit. We also introduce SVAGFormer, a strong modular baseline architecture for SVAG.

Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljo\v{s}a O\v{s}ep, Laura Leal-Taix\'e, Thomas Seidl• 2025

Related benchmarks

Task	Dataset	Result
Multiple Object Tracking	MOT20 (test)	--	458
Temporal Grounding	OVIS (test)	R1@0.533.37	27
Temporal Grounding	MOT17 (test)	R1@0.56.41	27
Temporal Grounding	MOT20 (test)	R1@0.54.17	14
Spatial Grounding	OVIS (test)	HOTA22.73	12
Spatial Grounding	MOT 17 (test)	HOTA0.6	12
Spatial Grounding	MOT20 (test)	HOTA0.43	12
Spatial Grounding	MOT17 (test)	HOTA0.597	12
Spatio-Temporal Grounding	OVIS, MOT17, and MOT20 (test)	m-HIoU13.52	9
Spatio-Temporal Video Action Grounding	SVAGEval (test)	m-HIoU14.148	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord