DocFormerv2: Local Features for Document Understanding
About
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine datasets shows state-of-the-art performance over strong baselines e.g. TabFact (4.3%), InfoVQA (1.4%), FUNSD (1%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, Doc- Formerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLi and Flamingo) on some tasks. Extensive ablations show that due to its pre-training, DocFormerv2 understands multiple modalities better than prior-art in VDU.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA (val) | VQA Score65.6 | 309 | |
| Document Visual Question Answering | DocVQA (test) | ANLS87.84 | 192 | |
| Information Extraction | CORD (test) | F1 Score97.7 | 133 | |
| Visual Question Answering | TextVQA (test) | Accuracy64 | 124 | |
| Information Visual Question Answering | InfoVQA (test) | ANLS48.8 | 92 | |
| Visual Question Answering | OCR-VQA (test) | Accuracy71.5 | 77 | |
| Form Understanding | FUNSD (test) | F1 Score88.89 | 73 | |
| Visual Question Answering | OCR-VQA (val) | Accuracy71.1 | 17 | |
| Scene Text Visual Question Answering | ST-VQA 1.0 (val) | ANLS72.9 | 15 | |
| Table Visual Question Answering | TabFact (test) | Accuracy0.832 | 15 |