Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

About

Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang• 2024

Related benchmarks

Task	Dataset	Result
Video Question Answering	MSRVTT-QA	Accuracy55.6	505
Video Question Answering	MSVD-QA	Accuracy68.9	393
Moment Retrieval	Charades-STA (test)	R@0.523	186
Highlight Detection	QVHighlights (test)	--	167
Temporal Video Grounding	Charades-STA (test)	Recall@IoU=0.526.6	124
Video Grounding	Charades-STA	R@1 IoU=0.526.6	113
Temporal Grounding	Charades-STA	mIoU28.5	107
Temporal Grounding	ActivityNet Captions	Recall@1 (IoU=0.5)23	85
Natural Language Video Localization	Charades-STA (test)	R@1 (IoU=0.5)26.6	61
Open-ended Video Question Answering	MSVD-QA	Accuracy68.9	59

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord