Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
About
Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | MSRVTT-QA | Accuracy55.6 | 491 | |
| Video Question Answering | MSVD-QA | Accuracy68.9 | 360 | |
| Moment Retrieval | Charades-STA (test) | R@0.523 | 186 | |
| Highlight Detection | QVHighlights (test) | -- | 161 | |
| Temporal Video Grounding | Charades-STA (test) | Recall@IoU=0.526.6 | 124 | |
| Video Grounding | Charades-STA | R@1 IoU=0.526.6 | 113 | |
| Temporal Grounding | Charades-STA | R@0.526.6 | 88 | |
| Temporal Grounding | ActivityNet Captions | Recall@1 (IoU=0.5)23 | 75 | |
| Natural Language Video Localization | Charades-STA (test) | R@1 (IoU=0.5)26.6 | 61 | |
| Open-ended Video Question Answering | MSVD-QA | Accuracy68.9 | 59 |