Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

About

We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware and contain both local and global cues. It first divides an input video into short-term clips, which are jointly encoded with timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, consisting of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments show that TemporalVLM outperforms previous methods across temporal reasoning and fine-grained understanding tasks, i.e., dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. To our best knowledge, our work is the first to incorporate LSTMs into video LLMs.

Fawad Javed Fateh, Umer Ahmed, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran• 2024

Related benchmarks

TaskDatasetResultRank
Highlight DetectionQVHighlights (test)
HIT@131.3
167
Temporal GroundingCharades-STA--
107
Dense Video CaptioningYouCook2
SODA_c3.4
40
Video highlight detectionQVHighlights
mAP0.251
32
Video-based Dialogue EvaluationVideo-ChatGPT
CI2.88
24
Temporal Video GroundingCharades-STA
R@1 (IoU=0.5)54.4
6
Temporal Video GroundingCharades-STA (test)
R@1 (IoU=0.5)0.301
6
Dense Video CaptioningYoucook2 (test)
SODA_c1.2
6
Temporal action segmentationIndustryASM
F1@10%22.3
3
Dense CaptioningYouCook2 zero-shot
SODA_c3.4
3
Showing 10 of 10 rows

Other info

Follow for update