Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Concept-skill Transferability-based Data Selection for Large Vision-Language Models

About

Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a reference model to select visual instruction tuning data for efficient finetuning of a target LVLM, focusing on diversity and transferability. Specifically, we cluster the training data using internal activations from a small model, which identifies VL concept-skill compositions needed by a target LVLM. We then sample data from these diverse clusters by considering their density and transferability, or the ability to transfer well to other concept-skill compositions. This approach ensures the diversity of these compositions, which is vital for LVLM generalization. Extensive experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines on two distinct datasets: LLaVA-1.5 and Vision-Flan. Using only 20% of the LLaVA-1.5 dataset, COINCIDE achieves performance comparable to the LVLM finetuned on the whole dataset, with 70% reduction of the wall-clock running time. On the Vision-Flan dataset, our method achieves superior results with only 16.7% of the training data.

Jaewoo Lee, Boyang Li, Sung Ju Hwang• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy84.3
2019
Visual Question AnsweringVizWiz
Accuracy46.8
1820
Visual Question AnsweringTextVQA
Accuracy58.6
1453
Visual Question AnsweringVQA v2
Accuracy66
1429
Text-based Visual Question AnsweringTextVQA
Accuracy55.6
962
Multimodal UnderstandingMMBench
Accuracy63.1
847
Multimodal EvaluationMME--
727
Multimodal ReasoningMM-Vet
MM-Vet Score28.5
517
Diagram Question AnsweringAI2D--
387
Object HallucinationPOPE Popular--
372
Showing 10 of 69 rows

Other info

Follow for update