Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Visual Instruction Tuning

About

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy80
1165
Visual Question AnsweringTextVQA
Accuracy61.2
1117
Visual Question AnsweringVizWiz
Accuracy60.5
1043
Visual Question AnsweringGQA
Accuracy63.3
963
Object Hallucination EvaluationPOPE
Accuracy86.5
935
Image CaptioningMS COCO Karpathy (test)
CIDEr0.3
682
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy80
664
Multimodal EvaluationMME
Score1.53e+3
557
Text-based Visual Question AnsweringTextVQA
Accuracy65.6
496
Video Question AnsweringMSRVTT-QA
Accuracy54.7
481
Showing 10 of 617 rows
...

Other info

Code

Follow for update