NExT-GPT: Any-to-Any Multimodal LLM
About
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy66.7 | 1165 | |
| Visual Question Answering | GQA | Accuracy58 | 963 | |
| Visual Question Answering | VQAv2 | Accuracy66 | 177 | |
| Visual Question Answering | VQA v2 (test) | Accuracy66.7 | 131 | |
| Video Understanding | MVBench (test) | Accuracy27.9 | 97 | |
| Visual Question Answering | VQAv2 (test) | VQA Accuracy66.7 | 72 | |
| Video Question Answering | ActivityNet (test) | Accuracy21.5 | 57 | |
| Text-to-Image Generation | MS-COCO 256x256 (val) | FID11.28 | 53 | |
| Knowledge-based Visual Question Answering | OKVQA | Accuracy0.521 | 52 | |
| Text-to-Image Generation | MSCOCO 30K | FID11.28 | 42 |