VideoChat: Chat-Centric Video Understanding
About
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | MSRVTT-QA | Accuracy45 | 481 | |
| Video Question Answering | MSRVTT-QA (test) | Accuracy45 | 371 | |
| Video Question Answering | MSVD-QA | Accuracy56.3 | 340 | |
| Video Question Answering | ActivityNet-QA | Accuracy49.1 | 319 | |
| Video Question Answering | ActivityNet-QA (test) | Accuracy26.5 | 275 | |
| Video Question Answering | MSVD-QA (test) | Accuracy56.3 | 274 | |
| Video Understanding | MVBench | -- | 247 | |
| Video Question Answering | NExT-QA (test) | Accuracy68.6 | 204 | |
| Multimodal Understanding | SEED-Bench | -- | 203 | |
| Video Question Answering | EgoSchema (Full) | Accuracy54.4 | 193 |