Visual Instruction Tuning

About

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.2	2019
Visual Question Answering	VizWiz	Accuracy60.5	1820
Visual Question Answering	TextVQA	Accuracy76.55	1453
Visual Question Answering	VQA v2	Accuracy80	1429
Visual Question Answering	GQA	Accuracy63.3	1425
Text-based Visual Question Answering	TextVQA	Accuracy81.8	962
Multimodal Understanding	MMBench	Accuracy72.5	847
Science Question Answering	ScienceQA	Accuracy71.6	791
Multimodal Evaluation	MME	Score1.86e+3	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy80	712

Showing 10 of 1000 rows

...

Other info

Code

Follow for update

@wizwand_team Discord