Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HRVDA: High-Resolution Visual Document Assistant

About

Leveraging vast training data, multimodal large language models (MLLMs) have demonstrated formidable general visual comprehension capabilities and achieved remarkable performance across various tasks. However, their performance in visual document understanding still leaves much room for improvement. This discrepancy is primarily attributed to the fact that visual document understanding is a fine-grained prediction task. In natural scenes, MLLMs typically use low-resolution images, leading to a substantial loss of visual information. Furthermore, general-purpose MLLMs do not excel in handling document-oriented instructions. In this paper, we propose a High-Resolution Visual Document Assistant (HRVDA), which bridges the gap between MLLMs and visual document understanding. This model employs a content filtering mechanism and an instruction filtering module to separately filter out the content-agnostic visual tokens and instruction-agnostic visual tokens, thereby achieving efficient model training and inference for high-resolution images. In addition, we construct a document-oriented visual instruction tuning dataset and apply a multi-stage training strategy to enhance the model's document modeling capabilities. Extensive experiments demonstrate that our model achieves state-of-the-art performance across multiple document understanding datasets, while maintaining training efficiency and inference speed comparable to low-resolution models.

Chaohu Liu, Kun Yin, Haoyu Cao, Xinghua Jiang, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Linli Xu• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy73.3
1285
Visual Question AnsweringChartQA
Accuracy67.6
371
Chart Question AnsweringChartQA
Accuracy67.6
356
Document Visual Question AnsweringDocVQA
ANLS72.1
263
Text-based Visual Question AnsweringTextVQA (val)
Accuracy73.3
262
Document Visual Question AnsweringDocVQA (test)
ANLS72.1
213
Chart Question AnsweringChartQA (test)--
176
Information Visual Question AnsweringInfoVQA (test)
ANLS43.5
130
Table Fact VerificationTabFact
Accuracy0.723
104
Table Question AnsweringWTQ (test)
Denotation Accuracy31.2
62
Showing 10 of 21 rows

Other info

Follow for update