Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

NExT-GPT: Any-to-Any Multimodal LLM

About

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy66.7
1165
Visual Question AnsweringGQA
Accuracy58
963
Visual Question AnsweringVQAv2
Accuracy66
177
Visual Question AnsweringVQA v2 (test)
Accuracy66.7
131
Video UnderstandingMVBench (test)
Accuracy27.9
97
Visual Question AnsweringVQAv2 (test)
VQA Accuracy66.7
72
Video Question AnsweringActivityNet (test)
Accuracy21.5
57
Text-to-Image GenerationMS-COCO 256x256 (val)
FID11.28
53
Knowledge-based Visual Question AnsweringOKVQA
Accuracy0.521
52
Text-to-Image GenerationMSCOCO 30K
FID11.28
42
Showing 10 of 27 rows

Other info

Follow for update