Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TubeDETR: Spatio-Temporal Video Grounding with Transformers

About

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks. Code and trained models are publicly available at https://antoyang.github.io/tubedetr.html.

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid• 2022

Related benchmarks

TaskDatasetResultRank
Spatio-Temporal Video GroundingHCSTVG v2 (val)
m_vIoU36.4
38
Spatio-Temporal Video GroundingVidSTG Interrogative Sentences (test)
m_vIoU25.7
33
Spatio-Temporal Video GroundingHCSTVG v1 (test)
m_vIoU32.4
30
Spatio-Temporal Video GroundingVidSTG Declarative Sentences
m_vIoU30.4
20
Spatio-Temporal Video GroundingHC-STVG (val)
Mean vIoU36.4
19
Spatio-Temporal Video GroundingVidSTG Declarative Sentences (test)
m_vIoU30.4
17
Spatio-Temporal Video GroundingVidSTG Declarative (test)
m_vIoU30.4
14
Spatio-Temporal Video GroundingHC-STVG v1 (test)
m_vIoU32.4
14
Action GroundingDaly (test)
Accuracy51.63
13
Spatio-Temporal Video GroundingHC-STVG v1
m_vIoU32.4
11
Showing 10 of 16 rows

Other info

Code

Follow for update