TubeDETR: Spatio-Temporal Video Grounding with Transformers

About

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks. Code and trained models are publicly available at https://antoyang.github.io/tubedetr.html.

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid• 2022

Related benchmarks

Task	Dataset	Result
Spatio-Temporal Video Grounding	HCSTVG v1 (test)	m_vIoU32.4	42
Spatio-Temporal Video Grounding	VidSTG Interrogative Sentences (test)	m_vIoU25.7	40
Spatio-Temporal Video Grounding	HCSTVG v2 (val)	m_vIoU36.4	38
Spatio-Temporal Video Grounding	VidSTG Declarative Sentences (test)	m_vIoU30.4	24
Spatio-Temporal Video Grounding	VidSTG Declarative Sentences	m_vIoU30.4	20
Spatio-Temporal Video Grounding	HC-STVG (val)	Mean vIoU36.4	19
Spatio-Temporal Video Grounding	VidSTG Declarative (test)	m_vIoU30.4	14
Spatio-Temporal Video Grounding	HC-STVG v1 (test)	m_vIoU32.4	14
Action Grounding	Daly (test)	Accuracy51.63	13
Spatio-Temporal Video Grounding	HC-STVG v1	m_vIoU32.4	11

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord