Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

About

We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks. SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding.

Haoyu Cao, Changcun Bao, Chaohu Liu, Huang Chen, Kun Yin, Hao Liu, Yinsong Liu, Deqiang Jiang, Xing Sun• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy66.3
1117
Visual Question AnsweringChartQA
Accuracy47.9
239
Visual Question AnsweringDocVQA
ANLS71.9
32
Information ExtractionCORD
F1 Score84.9
18
Information ExtractionSROIE
F1 Score85.8
16
Text-oriented Visual Question AnsweringKLC
F1 Score31.3
8
Text-oriented Visual Question AnsweringTextCaps
CIDEr101.4
7
Text-oriented Visual Question AnsweringInfoVQA
ANLS13.5
7
Text-oriented Visual Question AnsweringDeepForm
F150.7
6
Text-oriented Visual Question AnsweringWTQ
Accuracy25.5
6
Showing 10 of 12 rows

Other info

Follow for update