DocFormer: End-to-End Transformer for Document Understanding
About
We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Classification | RVL-CDIP (test) | Accuracy96.2 | 306 | |
| Document Visual Question Answering | DocVQA (test) | ANLS78.78 | 192 | |
| Information Extraction | CORD (test) | F1 Score96.99 | 133 | |
| Entity extraction | FUNSD (test) | Entity F1 Score84.55 | 104 | |
| Form Understanding | FUNSD (test) | F1 Score84.55 | 73 | |
| Information Extraction | FUNSD (test) | F1 Score84.55 | 55 | |
| Semantic Entity Recognition | CORD | F1 Score96.99 | 55 | |
| Entity recognition | CORD official (test) | F1 Score96.99 | 37 | |
| Semantic Entity Recognition | FUNSD (test) | F1 Score83.34 | 37 | |
| Semantic Entity Recognition | FUNSD | EN Score84.55 | 31 |