Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLMs Can Evolve Continually on Modality for X-Modal Reasoning

About

Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for $\mathbb{X}$-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%. Our code locates at https://github.com/JiazuoYu/PathWeave

Jiazuo Yu, Haomiao Xiong, Lu Zhang, Haiwen Diao, Yunzhi Zhuge, Lanqing Hong, Dong Wang, Huchuan Lu, You He, Long Chen• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy47.8
963
Video Question AnsweringMSRVTT-QA
Accuracy37.4
481
Audio ClassificationESC-50
Accuracy72.6
325
Video Question AnsweringMSVD-QA (test)
Accuracy48.2
274
Audio CaptioningAudioCaps (test)
CIDEr59.4
140
Video Question AnsweringMSVD
Accuracy48.2
100
Video CaptioningMSRVTT
CIDEr52.8
61
Image CaptioningCOCO (test)
CIDEr138.7
43
Image ClassificationSUN
Accuracy42.2
27
Audio Question and AnsweringClothoAQA
Accuracy33.5
20
Showing 10 of 23 rows

Other info

Code

Follow for update