Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DocFormerv2: Local Features for Document Understanding

About

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine datasets shows state-of-the-art performance over strong baselines e.g. TabFact (4.3%), InfoVQA (1.4%), FUNSD (1%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, Doc- Formerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLi and Flamingo) on some tasks. Extensive ablations show that due to its pre-training, DocFormerv2 understands multiple modalities better than prior-art in VDU.

Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA (val)
VQA Score65.6
309
Document Visual Question AnsweringDocVQA (test)
ANLS87.84
192
Information ExtractionCORD (test)
F1 Score97.7
133
Visual Question AnsweringTextVQA (test)
Accuracy64
124
Information Visual Question AnsweringInfoVQA (test)
ANLS48.8
92
Visual Question AnsweringOCR-VQA (test)
Accuracy71.5
77
Form UnderstandingFUNSD (test)
F1 Score88.89
73
Visual Question AnsweringOCR-VQA (val)
Accuracy71.1
17
Scene Text Visual Question AnsweringST-VQA 1.0 (val)
ANLS72.9
15
Table Visual Question AnsweringTabFact (test)
Accuracy0.832
15
Showing 10 of 11 rows

Other info

Follow for update