VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

About

While Video Large Language Models (Video-LLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization. However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. Specifically, we design the VidThinker pipeline, which automates annotation by generating instruction-conditioned captions, retrieving relevant video segments, and selecting key frames to enable efficient supervision. Using VidThinker, we build the VideoITG-40K dataset with 40K videos and 500K temporal grounding annotations. Our plug-and-play VideoITG model leverages Video-LLMs' visual-language alignment and reasoning for discriminative frame selection. VideoITG consistently boosts the performance on multiple multimodal video understanding benchmarks, demonstrating its effectiveness and potential.

Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Liu, Jose M. Alvarez, Lei Zhang, Zhiding Yu• 2025

Related benchmarks

Task	Dataset	Result
Multi-choice Video Question Answering	NEXT-QA	Overall Accuracy79.5	35
Multi-choice Video Question Answering	EgoSchema (test)	Accuracy51.6	26
Multi-choice Video Question Answering	VideoMME	Accuracy (no subs)67.3	21
Open-ended Video Question Answering	ActNet-QA	Accuracy57.4	18
Sports Video Understanding	DeepSport (test)	Fine-Grained Recognition Accuracy35.39	13
Multi-Choice Q&A	MVBench (val)	Accuracy72.2	9
Multi-Choice Q&A	LongVideoBench (val)	Accuracy61.9	7
Multi-Choice Q&A	MLVU	m-avg75	6
Multi-Choice Q&A	PerceptionTest (val)	Accuracy64.9	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord