LLMs Can Evolve Continually on Modality for X-Modal Reasoning
About
Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for $\mathbb{X}$-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%. Our code locates at https://github.com/JiazuoYu/PathWeave
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | GQA | Accuracy47.8 | 963 | |
| Video Question Answering | MSRVTT-QA | Accuracy37.4 | 481 | |
| Audio Classification | ESC-50 | Accuracy72.6 | 325 | |
| Video Question Answering | MSVD-QA (test) | Accuracy48.2 | 274 | |
| Audio Captioning | AudioCaps (test) | CIDEr59.4 | 140 | |
| Video Question Answering | MSVD | Accuracy48.2 | 100 | |
| Video Captioning | MSRVTT | CIDEr52.8 | 61 | |
| Image Captioning | COCO (test) | CIDEr138.7 | 43 | |
| Image Classification | SUN | Accuracy42.2 | 27 | |
| Audio Question and Answering | ClothoAQA | Accuracy33.5 | 20 |