Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

About

Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs' capabilities but rather modulates the model's responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features.

Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang• 2024

Related benchmarks

TaskDatasetResultRank
Image ReasoningMMMU
Accuracy47.4
17
Image ReasoningMathVista
Accuracy49.7
17
General MCQAMMBench
Accuracy76.2
10
User Preference & FluencyMMVet
MMVet User Preference Score41.5
10
User Preference & FluencyLLaVA-W
Score58.8
10
General MCQASEEDBench
Score71.3
9
Text-Rich VQAAI2D
Score71.6
9
Text-Rich VQATextVQA
Score80.2
8
Showing 8 of 8 rows

Other info

Follow for update