Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories

About

Data-efficient learning aims to eliminate redundancy in large training datasets by training models on smaller subsets of the most informative examples. While data selection has been extensively explored for vision models and large language models (LLMs), it remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of existing methods can outperform random selection at different subset sizes. In this work, we propose the first principled method for data-efficient instruction tuning of LVLMs. We prove that examples with similar cross-modal attention matrices during instruction tuning have similar gradients. Thus, they influence model parameters in a similar manner and convey the same information to the model during training. Building on this insight, we propose XMAS, which clusters examples based on the trajectories of the top singular values of their attention matrices obtained from fine-tuning a small proxy LVLM. By sampling a balanced subset from these clusters, XMAS effectively removes redundancy in large-scale LVLM training data. Extensive experiments show that XMAS can discard 50% of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving performance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training by 1.2x. This is 30% more data reduction compared to the best baseline for LLaVA-665k. The project's website can be found at https://bigml-cs-ucla.github.io/XMAS-project-page/.

Nilay Naharas, Dang Nguyen, Nesihan Bulut, Mohammadhossein Bateni, Vahab Mirrokni, Baharan Mirzasoleiman• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy76.2	2056
Visual Question Answering	TextVQA	Accuracy55.2	1455
Visual Question Answering	VQA v2	Accuracy75.1	1429
Multimodal Reasoning	MM-Vet	MM-Vet Score24.3	551
Multimodal Capability Evaluation	MM-Vet	Score31	429
Multimodal Evaluation	MM-Vet	--	249
Visual Question Answering	GQA	Mean Accuracy61.2	196
Visual Question Answering	GQA	Score49.4	193
Multimodal Evaluation	MMBench CN	Accuracy44	163
Multimodal Evaluation	MMBench	MMB^CN Score54.1	146

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord