Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TRINS: Towards Multimodal Language Models that Can Read

About

Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION using hybrid data annotation strategies that include machine-assisted and human-assisted annotation processes. It contains 39,153 text-rich images, captions, and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called a Language-vision Reading Assistant (LaRA), which is good at understanding textual content within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset, as well as other classical benchmarks. Lastly, we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks, demonstrating its effectiveness.

Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu, Changyou Chen, Tong Sun• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy53.1
1043
Visual Question AnsweringGQA
Accuracy42.4
374
Visual Question AnsweringOKVQA
Top-1 Accuracy58.1
283
Visual Question AnsweringChartQA
Accuracy25.6
239
Visual Question AnsweringDocVQA
Accuracy50.8
103
Visual Question AnsweringInfoVQA
Accuracy28.4
69
Visual Question AnsweringOCRVQA
Accuracy41.2
47
Multimodal Optical Character RecognitionOCRBench
Recognition Score211
34
Visual Question AnsweringVSR--
26
Visual Question AnsweringST-VQA
Accuracy47.2
15
Showing 10 of 19 rows

Other info

Follow for update