GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
About
While the recent advances in Multimodal Large Language Models (MLLMs) constitute a significant leap forward in the field, these models are predominantly confined to the realm of input-side multimodal comprehension, lacking the capacity for multimodal content generation. To fill this gap, we present GPT4Video, a unified multi-model framework that empowers Large Language Models (LLMs) with the capability of both video understanding and generation. Specifically, we develop an instruction-following-based approach integrated with the stable diffusion generative model, which has demonstrated to effectively and securely handle video generation scenarios. GPT4Video offers the following benefits: 1) It exhibits impressive capabilities in both video understanding and generation scenarios. For example, GPT4Video outperforms Valley by 11.8\% on the Video Question Answering task, and surpasses NExt-GPT by 2.3\% on the Text to Video generation task. 2) it endows the LLM/MLLM with video generation capabilities without requiring additional training parameters and can flexibly interface with a wide range of models to perform video generation. 3) it maintains a safe and healthy conversation not only in output-side but also the input side in an end-to-end manner. Qualitative and qualitative experiments demonstrate that GPT4Video holds the potential to function as a effective, safe and Humanoid-like video assistant that can handle both video understanding and generation scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | ActivityNet-QA | Accuracy48.7 | 319 | |
| Video Captioning | MSVD | CIDEr48 | 128 | |
| Video Question Answering | MSVD | Accuracy66.3 | 100 | |
| Video Captioning | MSRVTT | CIDEr42.3 | 61 | |
| Video Question Answering | MSRVTT | Accuracy49.8 | 46 | |
| Video Question Answering | COIN | Accuracy91.1 | 13 | |
| Video Question Answering | Breakfast | Accuracy90.9 | 13 | |
| Video Question Answering | LVU | Accuracy61.6 | 13 |