MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
About
Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | BBH, GSM8K | BBH Score32.27 | 30 | |
| General Capability | BBH, GSM8K, MMLU, TruthfulQA, HumanEval, MBPP | Average Score25.06 | 30 | |
| Knowledge | MMLU, TruthfulQA | MMLU32.08 | 30 | |
| Coding | HumanEval, MBPP | HumanEval Score18.26 | 30 | |
| Instruction Following | Tulu3 Evaluation Suite pool (test) | ARC91.53 | 25 | |
| Instruction Tuning | WizardLM | Reasoning Score74.1 | 20 | |
| Instruction Tuning | Alpaca GPT4 | Reasoning74.78 | 20 | |
| Instruction Tuning | CoT | Reasoning Score67.9 | 20 |