Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness
About
Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection. We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier). We further propose practical ways to score against each principle, which automatically adapts to a given dataset without tedious hyperparameter tuning. Comprehensive experiments on various benchmarks demonstrate that DataTailor achieves 101.3% of the performance of full-data fine-tuning with only 15% of the data, significantly reducing computational costs while maintaining superior results. This exemplifies the "Less is More" philosophy in MLLM development. The code and data is available in this \href{https://github.com/Yuqifan1117/DataTailor}{URL}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination | POPE Popular | -- | 372 | |
| Visual Question Answering | GQA (test-dev) | Accuracy49.5 | 236 | |
| Object Hallucination Evaluation | POPE Adversarial | Accuracy85.3 | 159 | |
| Object Hallucination Evaluation | POPE (Random) | Accuracy85.3 | 152 | |
| Object Hallucination Evaluation | POPE (test) | Accuracy85.3 | 107 | |
| Visual Question Answering | Vizwiz (val) | VQA Score31.8 | 66 | |
| Multimodal Question Answering | ScienceQA | Accuracy71 | 61 | |
| Multimodal Understanding | MME Perception | -- | 59 | |
| Multimodal Understanding | MME Cognition | Score319.2 | 45 | |
| Visual Question Answering | VizWiz | Acc49.5 | 31 |