Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

About

Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (< 1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.

Wanyun Xie, Francesco Tonin, Volkan Cevher• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMBench
Accuracy42.44
367
Multi-discipline Multimodal UnderstandingMMMU
Accuracy45.56
266
Science Question AnsweringScienceQA
Accuracy64.5
229
Visual Question AnsweringScienceQA
Accuracy87.26
210
Multimodal UnderstandingMMStar
Accuracy35.88
197
Visual Question AnsweringAI2D
Accuracy72.15
174
Optical Character Recognition BenchmarkingOCRBench
Accuracy57.2
109
Visual Question AnsweringRealworldQA
Accuracy57.39
98
Real-world Visual Question AnsweringRealworldQA
Accuracy46.54
91
Massive Multi-discipline Multimodal UnderstandingMMMU
Accuracy29.78
88
Showing 10 of 19 rows

Other info

Follow for update