Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

About

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We identify that current Video-LLMs have limitations for fine-grained video understanding since they lack effective temporal modeling and timestamp representation. In light of this, we sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge to represent timestamps. To optimize the training of Grounded-VideoLLM, we employ a multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance Grounded-VideoLLM's temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.

Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, Lifu Huang• 2024

Related benchmarks

Task	Dataset	Result
Video Question Answering	MSRVTT-QA	Accuracy60.3	513
Temporal Video Grounding	Charades-STA (test)	Recall@IoU=0.536.4	139
Temporal Grounding	Charades-STA	mIoU36.8	120
Grounded Video Question Answering	NExT-GQA	mIoU21.1	69
Open-ended Video Question Answering	MSVD-QA	Accuracy76.3	59
Dense Video Captioning	ActivityNet Captions	METEOR6.4	48
Video Temporal Grounding	ActivityNet Captions	Recall @ IoU=0.343.9	47
Grounded Video Question Answering	NExT-GQA (test)	Acc@GQA26.7	45
Video Question Answering	VCG Bench	CI3.34	42
Video Temporal Grounding	Charades-STA	R1@0.5 Recall34.3	42

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord