Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models

About

Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: https://github.com/Christina200/MoDE-official.git

Xiwen Wei, Mustafa Munir, Radu Marculescu• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy61.02	1820
Visual Question Answering	GQA	Accuracy37.03	1425
Text-based Visual Question Answering	TextVQA	Accuracy46.34	962
Science Question Answering	ScienceQA	Accuracy70.45	791
Visual Question Answering	GQA	Accuracy62.01	524
Multimodal Question Answering	ScienceQA	Accuracy81.01	61
Image Classification	ImageNet	Accuracy80.67	47
Continual Multimodal Instruction Tuning	CoIN ScienceQA TextVQA ImageNet GQA VizWiz Grounding Chameleon backbone	Accuracy53.02	22
Referring Expression Grounding	RefCOCO RefCOCO+ RefCOCOg	Accuracy44.99	10
Multimodal Understanding	ScienceQA, TextVQA, GQA, VizWiz, and ImageNet	Accuracy33.47	7

Showing 10 of 11 rows

Other info

GitHub

Follow for update

@wizwand_team Discord