Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

About

Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (< 1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.

Wanyun Xie, Francesco Tonin, Volkan Cevher• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMBench
Accuracy42.44
847
Science Question AnsweringScienceQA
Accuracy64.5
791
Visual Question AnsweringScienceQA
Accuracy87.26
446
Multimodal UnderstandingMMStar
Accuracy35.88
407
Multi-discipline Multimodal UnderstandingMMMU
Accuracy45.56
363
Visual Question AnsweringAI2D
Accuracy72.15
317
Visual Question AnsweringRealworldQA
Accuracy57.39
259
Mathematical Multimodal ReasoningMathVerse
Accuracy18.91
259
Massive Multi-discipline Multimodal UnderstandingMMMU
Accuracy29.78
216
Document Visual Question AnsweringDocVQA
Accuracy57.51
203
Showing 10 of 19 rows

Other info

Follow for update