PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
About
Multimodal Large Language Models (MLLMs) rely on multimodal pre-training over diverse data sources, where different datasets often induce complementary cross-modal alignment capabilities. Model merging provides a cost-effective mechanism for integrating multiple expert MLLMs with complementary strengths into a unified model. However, existing model merging research mainly focuses on post-finetuning scenarios, leaving the pre-training stage largely unexplored. We argue that the core of MLLM pre-training lies in establishing effective cross-modal alignment, which bridges visual and textual representations into a unified semantic space. Motivated by this insight, we introduce the post-alignment merging task, which aims to integrate cross-modal alignment capabilities learned from heterogeneous multimodal pre-training. This setting introduces two key challenges: cross-domain parameter interference, where parameter updates learned from different data distributions conflict during merging, and layer-wise alignment contribution disparity, where different layers and projectors contribute unevenly to cross-modal alignment. To address them, we propose \textbf{PivotMerge}, a post-alignment merging framework for cross-modal projectors. PivotMerge incorporates two key components: Shared-space Decomposition and Filtering, which disentangles shared alignment patterns from domain-specific variations and suppresses conflicting directions, and Alignment-guided Layer-wise Merging, which assigns layer-specific merging weights based on differing alignment contributions. We construct systematic CC12M-based post-alignment merging scenarios for evaluation. Extensive experiments on multiple multimodal benchmarks show that PivotMerge consistently outperforms existing baselines, demonstrating its effectiveness and generalization ability.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE (test) | -- | 107 | |
| Multi-modal Reasoning | MMVet (test) | Accuracy27.8 | 49 | |
| Multi-modal Question Answering | MMStar (test) | Accuracy27.5 | 17 | |
| Multimodal Perception | MME-P (test) | MME-P Score1.11e+3 | 13 | |
| Multimodal QA | MMBench EN (test) | MMBenchEN Score32 | 13 | |
| Multimodal QA | SEEDBench (test) | SEEDBench Score33.7 | 13 | |
| Multimodal QA | LLaVABench (test) | LLaVABench Score48 | 13 | |
| Multimodal Understanding and Reasoning | Multimodal Evaluation Suite (MMVet, MMBench_EN, SEED-Bench, LLaVABench, POPE, MME-P, MMVP, MMStar) (Random Sampling Splits of CC12M) | MMVet Score30.1 | 13 | |
| Visual Perception | MMVP (test) | MMVP Score34.3 | 13 |