Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

About

Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (< 1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.

Wanyun Xie, Francesco Tonin, Volkan Cevher• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMBench
Accuracy42.44
637
Science Question AnsweringScienceQA
Accuracy64.5
502
Visual Question AnsweringScienceQA
Accuracy87.26
370
Multimodal UnderstandingMMStar
Accuracy35.88
324
Multi-discipline Multimodal UnderstandingMMMU
Accuracy45.56
317
Visual Question AnsweringAI2D
Accuracy72.15
249
Mathematical Multimodal ReasoningMathVerse
Accuracy18.91
221
Visual Question AnsweringRealworldQA
Accuracy57.39
179
Massive Multi-discipline Multimodal UnderstandingMMMU
Accuracy29.78
152
Real-world Visual Question AnsweringRealworldQA
Accuracy46.54
140
Showing 10 of 19 rows

Other info

Follow for update