MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training
About
Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (< 1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMBench | Accuracy42.44 | 637 | |
| Science Question Answering | ScienceQA | Accuracy64.5 | 502 | |
| Visual Question Answering | ScienceQA | Accuracy87.26 | 370 | |
| Multimodal Understanding | MMStar | Accuracy35.88 | 324 | |
| Multi-discipline Multimodal Understanding | MMMU | Accuracy45.56 | 317 | |
| Visual Question Answering | AI2D | Accuracy72.15 | 249 | |
| Mathematical Multimodal Reasoning | MathVerse | Accuracy18.91 | 221 | |
| Visual Question Answering | RealworldQA | Accuracy57.39 | 179 | |
| Massive Multi-discipline Multimodal Understanding | MMMU | Accuracy29.78 | 152 | |
| Real-world Visual Question Answering | RealworldQA | Accuracy46.54 | 140 |