VideoChat: Chat-Centric Video Understanding

About

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, Yu Qiao• 2023

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	--	563
Multimodal Understanding	SEED-Bench	--	516
Video Question Answering	MSRVTT-QA	Accuracy45	505
Video Question Answering	ActivityNet-QA	Accuracy49.1	418
Video Question Answering	MSVD-QA	Accuracy56.3	393
Multimodal Sentiment Analysis	CMU-MOSI (test)	--	385
Video Question Answering	MSRVTT-QA (test)	Accuracy45	376
Video Understanding	VideoMME	Score (Overall)39.5	357
3D Question Answering	ScanQA (val)	CIDEr49.2	290
Video Question Answering	ActivityNet-QA (test)	Accuracy49.1	288

Showing 10 of 158 rows

...

Other info

Follow for update

@wizwand_team Discord