DocFormer: End-to-End Transformer for Document Understanding

About

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha• 2021

Related benchmarks

Task	Dataset	Result
Document Classification	RVL-CDIP (test)	Accuracy96.2	306
Document Visual Question Answering	DocVQA (test)	ANLS78.78	292
Information Extraction	CORD (test)	F1 Score96.99	133
Entity extraction	FUNSD (test)	Entity F1 Score84.55	104
Form Understanding	FUNSD (test)	F1 Score84.55	73
Information Extraction	FUNSD (test)	F1 Score84.55	55
Semantic Entity Recognition	CORD	F1 Score96.99	55
Entity recognition	CORD official (test)	F1 Score96.99	37
Semantic Entity Recognition	FUNSD (test)	F1 Score83.34	37
Semantic Entity Recognition	FUNSD	EN Score84.55	31

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord