Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models

About

Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: https://github.com/Christina200/MoDE-official.git

Xiwen Wei, Mustafa Munir, Radu Marculescu• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy61.02
1043
Visual Question AnsweringGQA
Accuracy37.03
963
Text-based Visual Question AnsweringTextVQA
Accuracy46.34
496
Visual Question AnsweringGQA
Accuracy62.01
374
Science Question AnsweringScienceQA
Accuracy70.45
229
Image ClassificationImageNet
Accuracy80.67
47
Multimodal Question AnsweringScienceQA
Accuracy81.01
35
Continual Multimodal Instruction TuningCoIN ScienceQA TextVQA ImageNet GQA VizWiz Grounding Chameleon backbone
Accuracy53.02
22
Referring Expression GroundingRefCOCO RefCOCO+ RefCOCOg
Accuracy44.99
10
Multimodal UnderstandingScienceQA, TextVQA, GQA, VizWiz, and ImageNet
Accuracy33.47
7
Showing 10 of 11 rows

Other info

GitHub

Follow for update