Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

About

A truly capable AI system must do more than detect objects or recognize activities in isolation. It must form unified, grounded representations of who is acting, what they are doing, and when and where these actions unfold. These representations provide the perceptual bedrock for high-level reasoning, planning, and embodied interaction in the real world. Building such agents is central to long-horizon goals in embodied AI and robotics. Current video benchmarks evaluate fragments of these capabilities in isolation. They focus on either spatial grounding, object tracking, or temporal localization. As a result, they cannot rigorously measure progress on their joint, multi-instance integration. We introduce Spatio-temporal Video Action Grounding (SVAG), a task and benchmark that explicitly targets this unified competence by requiring models to simultaneously detect, track, and temporally localize all objects that satisfy a natural language query in complex, multi-actor scenes. To support this task, we construct SVAG-Bench. It comprises 688 videos, 19,590 verified annotations, and 903 unique action verbs drawn from crowded urban environments, wildlife, and traffic surveillance. Each video has on average 28.5 action-centric queries. This yields the densest annotation among comparable video grounding benchmarks and enables fine-grained evaluation of multi-actor disambiguation, temporal overlap, and action compositionality. Annotations are produced by a pipeline that combines expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification to ensure both linguistic diversity and correctness. We further release SVAGEval, a standardized multi-referent evaluation toolkit. We also introduce SVAGFormer, a strong modular baseline architecture for SVAG.

Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljo\v{s}a O\v{s}ep, Laura Leal-Taix\'e, Thomas Seidl• 2025

Related benchmarks

TaskDatasetResultRank
Multiple Object TrackingMOT20 (test)--
458
Temporal GroundingOVIS (test)
R1@0.533.37
27
Temporal GroundingMOT17 (test)
R1@0.56.41
27
Temporal GroundingMOT20 (test)
R1@0.54.17
14
Spatial GroundingOVIS (test)
HOTA22.73
12
Spatial GroundingMOT 17 (test)
HOTA0.6
12
Spatial GroundingMOT20 (test)
HOTA0.43
12
Spatial GroundingMOT17 (test)
HOTA0.597
12
Spatio-Temporal GroundingOVIS, MOT17, and MOT20 (test)
m-HIoU13.52
9
Spatio-Temporal Video Action GroundingSVAGEval (test)
m-HIoU14.148
4
Showing 10 of 10 rows

Other info

Follow for update