Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

About

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression. Existing approaches mainly treat this complicated task as a parallel frame-grounding problem and thus suffer from two types of inconsistency drawbacks: feature alignment inconsistency and prediction inconsistency. In this paper, we present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT), to alleviate these issues. Specially, we introduce a novel multi-modal template as the global objective to address this task, which explicitly constricts the grounding region and associates the predictions among all video frames. Moreover, to generate the above template under sufficient video-textual perception, an encoder-decoder architecture is proposed for effective global context modeling. Thanks to these critical designs, STCAT enjoys more consistent cross-modal feature alignment and tube prediction without reliance on any pre-trained object detectors. Extensive experiments show that our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks (VidSTG and HC-STVG), illustrating the superiority of the proposed framework to better understanding the association between vision and natural language. Code is publicly available at https://github.com/jy0205/STCAT.

Yang Jin, Yongzhi Li, Zehuan Yuan, Yadong Mu• 2022

Related benchmarks

TaskDatasetResultRank
Spatio-Temporal Video GroundingVidSTG Interrogative Sentences (test)
m_vIoU28.22
33
Spatio-Temporal Video GroundingHCSTVG v1 (test)
m_vIoU35.1
30
Spatio-Temporal Video GroundingVidSTG Declarative Sentences
m_vIoU33.1
20
Spatio-Temporal Video GroundingHC-STVG (val)
Mean vIoU31.2
19
Spatio-Temporal Video GroundingVidSTG Declarative Sentences (test)
m_vIoU33.14
17
Spatio-Temporal Video GroundingVidSTG Declarative (test)
m_vIoU33.1
14
Spatio-Temporal Video GroundingHC-STVG v1 (test)
m_vIoU35
14
Action GroundingDaly (test)
Accuracy55.9
13
Spatio-Temporal Video GroundingHC-STVG v1
m_vIoU35.1
11
Spatio-Temporal Video GroundingVidSTG Declarative Sentences 1.0 (test)
Mean vIoU33.1
9
Showing 10 of 13 rows

Other info

Code

Follow for update