Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

About

This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou• 2023

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy38.5
247
Video Question AnsweringNExT-QA (test)
Accuracy43.59
204
Video Question AnsweringEgoSchema (Full)
Accuracy33
193
Video UnderstandingVideoMME
Overall Score34.7
192
Moment RetrievalCharades-STA (test)
R@0.532.2
172
Highlight DetectionQVHighlights (test)
HIT@137.9
151
Temporal Video GroundingCharades-STA (test)
Recall@IoU=0.546.7
117
Video GroundingCharades-STA
R@1 IoU=0.546.7
113
Long Video UnderstandingLongVideoBench
Score38.5
110
Video Question AnsweringNEXT-QA
Overall Accuracy71.05
105
Showing 10 of 71 rows
...

Other info

Code

Follow for update