Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

About

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.

Zijing Wang, Mingyang Wang, Ercong Nie, Yongkang Liu, Shi Feng, Mengjie Zhao, Daling Wang, Xiaocui Yang, Hinrich Sch\"utze• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMStar
Accuracy34.86
407
Multimodal UnderstandingSEEDBench2 Plus
Accuracy40.45
138
Multimodal UnderstandingMMMU
Accuracy37.22
34
Multilingual Multimodal Multiple-Choice Question AnsweringAfri-MCQA
Average Accuracy45.2
15
Visual Question AnsweringCVQA--
14
Multilingual Visual Question AnsweringMaXM
Avg. Score (MaXM)33.44
11
Multimodal UnderstandingXMMMU
Avg_mul33.85
11
Multicultural Visual ReasoningMaRVL
Avg_mul Score62.91
10
Visual Question AnsweringxGQA
Avg_mul Score48.04
10
Showing 9 of 9 rows

Other info

Follow for update