TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

About

This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou• 2023

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy38.5	563
Video Question Answering	MSRVTT-QA	Accuracy45	505
Video Understanding	VideoMME	Score (Overall)30.2	357
Long Video Understanding	LongVideoBench	Score38.5	269
Video Question Answering	VideoMME	Accuracy34.7	251
Video Question Answering	EgoSchema (Full)	Accuracy33	241
Video Understanding	VideoMME	Overall Score34.7	222
Long Video Understanding	MLVU	--	205
Video Question Answering	NExT-QA (test)	Accuracy43.59	204
Moment Retrieval	Charades-STA (test)	R@0.532.2	186

Showing 10 of 111 rows

...

Other info

Code

Follow for update

@wizwand_team Discord