Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VideoChat: Chat-Centric Video Understanding

About

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, Yu Qiao• 2023

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA
Accuracy45
481
Video Question AnsweringMSRVTT-QA (test)
Accuracy45
371
Video Question AnsweringMSVD-QA
Accuracy56.3
340
Video Question AnsweringActivityNet-QA
Accuracy49.1
319
Video Question AnsweringActivityNet-QA (test)
Accuracy26.5
275
Video Question AnsweringMSVD-QA (test)
Accuracy56.3
274
Video UnderstandingMVBench--
247
Video Question AnsweringNExT-QA (test)
Accuracy68.6
204
Multimodal UnderstandingSEED-Bench--
203
Video Question AnsweringEgoSchema (Full)
Accuracy54.4
193
Showing 10 of 107 rows
...

Other info

Follow for update