Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DocFormer: End-to-End Transformer for Document Understanding

About

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha• 2021

Related benchmarks

TaskDatasetResultRank
Document ClassificationRVL-CDIP (test)
Accuracy96.2
306
Document Visual Question AnsweringDocVQA (test)
ANLS78.78
192
Information ExtractionCORD (test)
F1 Score96.99
133
Entity extractionFUNSD (test)
Entity F1 Score84.55
104
Form UnderstandingFUNSD (test)
F1 Score84.55
73
Information ExtractionFUNSD (test)
F1 Score84.55
55
Semantic Entity RecognitionCORD
F1 Score96.99
55
Entity recognitionCORD official (test)
F1 Score96.99
37
Semantic Entity RecognitionFUNSD (test)
F1 Score83.34
37
Semantic Entity RecognitionFUNSD
EN Score84.55
31
Showing 10 of 18 rows

Other info

Follow for update