Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

About

Building general-purpose models that can perceive diverse real-world modalities and solve various tasks is an appealing target in artificial intelligence. In this paper, we present ChatBridge, a novel multimodal language model that leverages the expressive capabilities of language as the catalyst to bridge the gap between various modalities. We show that only language-paired two-modality data is sufficient to connect all modalities. ChatBridge leverages recent large language models (LLM) and extends their zero-shot capabilities to incorporate diverse multimodal inputs. ChatBridge undergoes a two-stage training. The first stage aligns each modality with language, which brings emergent multimodal correlation and collaboration abilities. The second stage instruction-finetunes ChatBridge to align it with user intent with our newly proposed multimodal instruction tuning dataset, named MULTIS, which covers a wide range of 16 multimodal tasks of text, image, video, and audio modalities. We show strong quantitative and qualitative results on zero-shot multimodal tasks covering text, image, video, and audio modalities. All codes, data, and models of ChatBridge will be open-sourced.

Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, Jing Liu• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy47.3
963
Audio ClassificationESC-50
Accuracy22.8
325
Visual Question AnsweringOKVQA
Top-1 Accuracy45.2
283
Video Question AnsweringMSVD-QA (test)
Accuracy45.3
274
Audio CaptioningAudioCaps (test)
CIDEr26.2
140
Image CaptioningFlickr30K
CIDEr Score82.5
111
Image CaptioningNoCaps
CIDEr115.7
101
Video Question AnsweringMSVD
Accuracy46.2
100
Audio CaptioningClotho
CIDEr26.2
60
Audio-Visual Question AnsweringMUSIC-AVQA (test)
Acc (Avg)43
59
Showing 10 of 27 rows

Other info

Follow for update