Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

About

Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat-Seng Chua, Yueting Zhuang, Siliang Tang• 2024

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA
Accuracy55.6
481
Highlight DetectionQVHighlights (test)--
151
Temporal Video GroundingCharades-STA (test)
Recall@IoU=0.526.6
117
Video GroundingCharades-STA
R@1 IoU=0.526.6
113
Natural Language Video LocalizationCharades-STA (test)
R@1 (IoU=0.5)26.6
61
Open-ended Video Question AnsweringMSVD-QA
Accuracy68.9
59
Dense Video CaptioningActivityNet Captions
METEOR4.7
43
Temporal Video GroundingCharades-STA
Rank-1 Recall (IoU=0.5)26.6
33
Temporal GroundingCharades-STA
mIoU28.5
33
Video highlight detectionQVHighlights
mAP0.076
29
Showing 10 of 17 rows

Other info

Follow for update