Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PandaGPT: One Model To Instruction-Follow Them All

About

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU). We hope that PandaGPT serves as an initial step toward building AGI that can perceive and understand inputs in different modalities holistically, as we humans do. Our project page is at https://panda-gpt.github.io/.

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, Deng Cai• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA (test)
Accuracy41.6
188
Multimodal Sentiment AnalysisMOSEI--
168
Multimodal Sentiment AnalysisCMU-MOSI--
144
Emotion RecognitionIEMOCAP--
115
Multimodal Sentiment AnalysisCH-SIMS (test)
F1 Score74.7
108
Multimodal Emotion Recognition in ConversationMELD
Weighted Avg F1 Score37.88
36
Audio-Visual Question AnsweringMUSIC-AVQA
Accuracy33.7
33
Multimodal Sentiment AnalysisCH-SIMS
F1 Score68.38
32
Multimodal Emotion RecognitionMER 2023
F1 Score40.21
30
Binary manipulation detectionMMFakeBench 1000 samples (val)
F1 Score24.6
28
Showing 10 of 54 rows

Other info

Follow for update