A Step Toward Federated Pretraining of Multimodal Large Language Models

About

The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a shared alignment basis and client-specific coefficients, then performs reliability-weighted fusion to suppress parameter interference. Furthermore, Fed-CMP introduces Orthogonality-Preserved Momentum, which applies momentum to the shared alignment basis via orthogonal projection, accumulating historical optimization directions while preserving geometric structure. We construct four federated pre-training scenarios based on public datasets, and extensive experiments validate that Fed-CMP significantly outperforms existing baselines.

Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy76.2	2019
Multimodal Understanding	MMBench	Accuracy35.2	847
Multimodal Evaluation	MME	--	727
Multimodal Reasoning	MM-Vet	MM-Vet Score33.4	517
Multimodal Perception and Cognition	MME	Overall Score1.14e+3	270
Multimodal Understanding	SEED	Accuracy30.5	216
Visual Perception	MMVP	Accuracy34.3	118
Multimodal Understanding	LLaVA-Bench	Overall Score48.2	94
Multimodal Visual Perception	MMVP	Accuracy36.9	72
Visual Pattern Recognition	MMVP	Accuracy33	30

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord