Concept-skill Transferability-based Data Selection for Large Vision-Language Models

About

Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a reference model to select visual instruction tuning data for efficient finetuning of a target LVLM, focusing on diversity and transferability. Specifically, we cluster the training data using internal activations from a small model, which identifies VL concept-skill compositions needed by a target LVLM. We then sample data from these diverse clusters by considering their density and transferability, or the ability to transfer well to other concept-skill compositions. This approach ensures the diversity of these compositions, which is vital for LVLM generalization. Extensive experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines on two distinct datasets: LLaVA-1.5 and Vision-Flan. Using only 20% of the LLaVA-1.5 dataset, COINCIDE achieves performance comparable to the LVLM finetuned on the whole dataset, with 70% reduction of the wall-clock running time. On the Vision-Flan dataset, our method achieves superior results with only 16.7% of the training data.

Jaewoo Lee, Boyang Li, Sung Ju Hwang• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy84.3	2019
Visual Question Answering	VizWiz	Accuracy46.8	1820
Visual Question Answering	TextVQA	Accuracy58.6	1453
Visual Question Answering	VQA v2	Accuracy66	1429
Text-based Visual Question Answering	TextVQA	Accuracy55.6	962
Multimodal Understanding	MMBench	Accuracy63.1	847
Multimodal Evaluation	MME	--	727
Multimodal Reasoning	MM-Vet	MM-Vet Score28.5	517
Diagram Question Answering	AI2D	--	387
Object Hallucination	POPE Popular	--	372

Showing 10 of 69 rows

Other info

Follow for update

@wizwand_team Discord