MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

About

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method.

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang• 2023

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy55.1	635
Video Question Answering	MSRVTT-QA	Accuracy52.7	513
Video Question Answering	ActivityNet-QA	Accuracy51.5	438
Video Question Answering	MSVD-QA	Accuracy75.2	401
Video Question Answering	MSRVTT-QA (test)	Accuracy52.7	376
Video Understanding	VideoMME	Score (Overall)38.2	369
Long Video Understanding	LongVideoBench	Score55.1	290
Video Question Answering	ActivityNet-QA (test)	Accuracy45.7	288
Video Question Answering	MSVD-QA (test)	Accuracy75.2	279
Long Video Understanding	LVBench	Accuracy22.5	267

Showing 10 of 112 rows

...

Other info

Code

Follow for update

@wizwand_team Discord