Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

About

Spatio-temporal grounding and reasoning aims to locate the temporal segment and spatial region of an event in a video given a user query, while also reasoning about semantics such as causality, temporal order, and action relationships. To achieve this, current MLLMs primarily treats bounding boxes as text tokens and generates them autoregressively. However, such autoregressive spatial decoding leads to very-long output sequences, causing spatial errors to accumulated over time and the localization results to progressively drift across a video. To address this, we present a Detector-Empowered Video LLM, short for DEViL, which couples a Video LLM with an open-vocabulary detector (OVD). Specifically, the MLLM and detector are connected via a reference-semantic token (RST) that distills the user query into a rich semantic representation. Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding, enabling end-to-end learning of both referential understanding and spatial localization. Furthermore, we propose a tube-mined temporal regularization (TTReg) within OVD, which drives the OVD to generate temporally-consistent queries for target objects, thereby ensuring effective temporal association. Experiments demonstrate that DEViL achieves strong performance across various fine-grained video understanding tasks, particularly STVG and GroundedVQA. Code will be released on https://github.com/gaostar123/DeViL.

Shida Gao, Feng Xue, Xiangfeng Wang, Anlong Ming, Teng Long, Yihua Shao, Haozhe Wang, Zhaowen Lin, Wei Wang, Nicu Sebe• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy65.4
247
Temporal Video GroundingCharades-STA (test)
Recall@IoU=0.551.5
117
Temporal Video UnderstandingTempCompass--
52
Grounded Video Question AnsweringNExT-GQA (test)
mIoU27.9
24
Spatio-Temporal Video GroundingVidSTG Declarative Sentences
m_vIoU32
20
Spatio-Temporal ReasoningV-Star
Accuracy34.3
14
Spatio-Temporal Video GroundingHC-STVG v1
m_vIoU36.2
11
Spatio-Temporal Video GroundingHC-STVG v2
m_tIoU58
9
Spatio-Temporal Video GroundingVidSTG Interrogative Sentence
m_vIoU27.7
8
Video UnderstandingVideoMME
Accuracy (No Subtitles)0.577
7
Showing 10 of 12 rows

Other info

Follow for update