Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Step Toward Federated Pretraining of Multimodal Large Language Models

About

The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a shared alignment basis and client-specific coefficients, then performs reliability-weighted fusion to suppress parameter interference. Furthermore, Fed-CMP introduces Orthogonality-Preserved Momentum, which applies momentum to the shared alignment basis via orthogonal projection, accumulating historical optimization directions while preserving geometric structure. We construct four federated pre-training scenarios based on public datasets, and extensive experiments validate that Fed-CMP significantly outperforms existing baselines.

Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy76.2
2019
Multimodal UnderstandingMMBench
Accuracy35.2
847
Multimodal EvaluationMME--
727
Multimodal ReasoningMM-Vet
MM-Vet Score33.4
517
Multimodal Perception and CognitionMME
Overall Score1.14e+3
270
Multimodal UnderstandingSEED
Accuracy30.5
216
Visual PerceptionMMVP
Accuracy34.3
118
Multimodal UnderstandingLLaVA-Bench
Overall Score48.2
94
Multimodal Visual PerceptionMMVP
Accuracy36.9
72
Visual Pattern RecognitionMMVP
Accuracy33
30
Showing 10 of 23 rows

Other info

Follow for update