DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

About

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.

Zijing Wang, Mingyang Wang, Ercong Nie, Yongkang Liu, Shi Feng, Mengjie Zhao, Daling Wang, Xiaocui Yang, Hinrich Sch\"utze• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMStar	Accuracy34.86	511
Multimodal Understanding	SEEDBench2 Plus	Accuracy40.45	138
Multimodal Understanding	MMMU	Accuracy37.22	34
Multilingual Multimodal Multiple-Choice Question Answering	Afri-MCQA	Average Accuracy45.2	15
Visual Question Answering	CVQA	--	14
Multilingual Visual Question Answering	MaXM	Avg. Score (MaXM)33.44	11
Multimodal Understanding	XMMMU	Avg_mul33.85	11
Multicultural Visual Reasoning	MaRVL	Avg_mul Score62.91	10
Visual Question Answering	xGQA	Avg_mul Score48.04	10

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord