Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

About

While Video Large Language Models (Video-LLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization. However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. Specifically, we design the VidThinker pipeline, which automates annotation by generating instruction-conditioned captions, retrieving relevant video segments, and selecting key frames to enable efficient supervision. Using VidThinker, we build the VideoITG-40K dataset with 40K videos and 500K temporal grounding annotations. Our plug-and-play VideoITG model leverages Video-LLMs' visual-language alignment and reasoning for discriminative frame selection. VideoITG consistently boosts the performance on multiple multimodal video understanding benchmarks, demonstrating its effectiveness and potential.

Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Liu, Jose M. Alvarez, Lei Zhang, Zhiding Yu• 2025

Related benchmarks

TaskDatasetResultRank
Multi-choice Video Question AnsweringEgoSchema (test)
Accuracy51.6
26
Multi-choice Video Question AnsweringVideoMME
Accuracy (no subs)67.3
21
Multi-choice Video Question AnsweringNEXT-QA
Overall Accuracy79.5
19
Open-ended Video Question AnsweringActNet-QA
Accuracy57.4
18
Sports Video UnderstandingDeepSport (test)
Fine-Grained Recognition Accuracy35.39
13
Multi-Choice Q&AMVBench (val)
Accuracy72.2
9
Multi-Choice Q&ALongVideoBench (val)
Accuracy61.9
7
Multi-Choice Q&AMLVU
m-avg75
6
Multi-Choice Q&APerceptionTest (val)
Accuracy64.9
6
Showing 9 of 9 rows

Other info

Follow for update