MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training
About
Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (< 1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMBench | Accuracy42.44 | 367 | |
| Multi-discipline Multimodal Understanding | MMMU | Accuracy45.56 | 266 | |
| Science Question Answering | ScienceQA | Accuracy64.5 | 229 | |
| Visual Question Answering | ScienceQA | Accuracy87.26 | 210 | |
| Multimodal Understanding | MMStar | Accuracy35.88 | 197 | |
| Visual Question Answering | AI2D | Accuracy72.15 | 174 | |
| Optical Character Recognition Benchmarking | OCRBench | Accuracy57.2 | 109 | |
| Visual Question Answering | RealworldQA | Accuracy57.39 | 98 | |
| Real-world Visual Question Answering | RealworldQA | Accuracy46.54 | 91 | |
| Massive Multi-discipline Multimodal Understanding | MMMU | Accuracy29.78 | 88 |