Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Step Toward Federated Pretraining of Multimodal Large Language Models

About

The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a shared alignment basis and client-specific coefficients, then performs reliability-weighted fusion to suppress parameter interference. Furthermore, Fed-CMP introduces Orthogonality-Preserved Momentum, which applies momentum to the shared alignment basis via orthogonal projection, accumulating historical optimization directions while preserving geometric structure. We construct four federated pre-training scenarios based on public datasets, and extensive experiments validate that Fed-CMP significantly outperforms existing baselines.

Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy76.2
1455
Multimodal EvaluationMME--
658
Multimodal UnderstandingMMBench
Accuracy35.2
637
Multimodal ReasoningMM-Vet
MM-Vet Score33.4
431
Multimodal UnderstandingSEED
Accuracy30.5
183
Multimodal Perception and CognitionMME
Overall Score1.14e+3
182
Visual PerceptionMMVP
Accuracy34.3
82
Multimodal Visual PerceptionMMVP
Accuracy36.9
72
Multimodal UnderstandingLLaVA-Bench
Overall Score48.2
72
Multimodal Visual Pattern UnderstandingMMVP
Accuracy34.7
25
Showing 10 of 23 rows

Other info

Follow for update