Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

About

Building general-purpose models that can perceive diverse real-world modalities and solve various tasks is an appealing target in artificial intelligence. In this paper, we present ChatBridge, a novel multimodal language model that leverages the expressive capabilities of language as the catalyst to bridge the gap between various modalities. We show that only language-paired two-modality data is sufficient to connect all modalities. ChatBridge leverages recent large language models (LLM) and extends their zero-shot capabilities to incorporate diverse multimodal inputs. ChatBridge undergoes a two-stage training. The first stage aligns each modality with language, which brings emergent multimodal correlation and collaboration abilities. The second stage instruction-finetunes ChatBridge to align it with user intent with our newly proposed multimodal instruction tuning dataset, named MULTIS, which covers a wide range of 16 multimodal tasks of text, image, video, and audio modalities. We show strong quantitative and qualitative results on zero-shot multimodal tasks covering text, image, video, and audio modalities. All codes, data, and models of ChatBridge will be open-sourced.

Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, Jing Liu• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy47.3
1249
Audio ClassificationESC-50
Accuracy22.8
374
Visual Question AnsweringOKVQA
Top-1 Accuracy45.2
283
Video Question AnsweringMSVD-QA (test)
Accuracy45.3
279
Video Question AnsweringMSVD
Accuracy46.2
152
Audio CaptioningAudioCaps (test)
CIDEr26.2
140
Image CaptioningFlickr30K
CIDEr Score82.5
111
Image CaptioningNoCaps
CIDEr115.7
101
Video CaptioningVATEX
CIDEr48.9
76
Audio CaptioningClotho
CIDEr26.2
60
Showing 10 of 28 rows

Other info

Follow for update