Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

About

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, Tong Sun• 2023

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy41.8
496
Visual Question AnsweringChartQA
Accuracy12.2
239
Chart Question AnsweringChartQA
Accuracy12.2
229
Information ExtractionCORD (test)
F1 Score13.55
133
Visual Question AnsweringDocVQA
Accuracy48.3
103
Document-oriented Visual Question AnsweringDocVQA
Accuracy12.3
72
Visual Question AnsweringInfoVQA
Accuracy16.5
69
Information ExtractionSROIE (test)
F1 Score2.38
58
Information ExtractionFUNSD (test)
F1 Score1.71
55
Document Visual Question AnsweringDocVQA v1.0 (test)
ANLS11.6
49
Showing 10 of 27 rows

Other info

Follow for update