Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

About

We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multi-scale visual features to effectively handle various font sizes within document images. To address the increasing costs of considering the multi-scale visual inputs for MLLMs, we propose the Hierarchical Visual Feature Aggregation (HVFA) module, designed to reduce the number of input tokens to LLMs. Leveraging a feature pyramid with cross-attentive pooling, our approach effectively manages the trade-off between information loss and efficiency without being affected by varying document image sizes. Furthermore, we introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text, eventually minimizing the risk of truncated text caused by the limited capacity of LLMs. Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks.

Jaeyoo Park, Jin Young Choi, Jeonghyung Park, Bohyung Han• 2024

Related benchmarks

TaskDatasetResultRank
Document Visual Question AnsweringDocVQA (test)
ANLS72.7
192
Chart Question AnsweringChartQA (test)
Accuracy63.3
129
Visual Question AnsweringTextVQA (test)
Accuracy59.2
124
Table Fact VerificationTabFact (test)
Accuracy68.2
98
Information Visual Question AnsweringInfoVQA (test)
ANLS45.9
92
Table Question AnsweringWikiTableQuestions (test)
Accuracy34.5
86
Image CaptioningTextCaps (test)
CIDEr135.2
50
Document Visual Question AnsweringDocVQA v1.0 (test)
ANLS72.7
49
Table Question AnsweringWTQ (test)
Denotation Accuracy34.5
45
Visual Machine Reading ComprehensionVisualMRC (test)
CIDEr228.7
18
Showing 10 of 14 rows

Other info

Follow for update