LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
About
Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-based Visual Question Answering | TextVQA | Accuracy41.8 | 496 | |
| Visual Question Answering | ChartQA | Accuracy12.2 | 239 | |
| Chart Question Answering | ChartQA | Accuracy12.2 | 229 | |
| Information Extraction | CORD (test) | F1 Score13.55 | 133 | |
| Visual Question Answering | DocVQA | Accuracy48.3 | 103 | |
| Document-oriented Visual Question Answering | DocVQA | Accuracy12.3 | 72 | |
| Visual Question Answering | InfoVQA | Accuracy16.5 | 69 | |
| Information Extraction | SROIE (test) | F1 Score2.38 | 58 | |
| Information Extraction | FUNSD (test) | F1 Score1.71 | 55 | |
| Document Visual Question Answering | DocVQA v1.0 (test) | ANLS11.6 | 49 |