End-to-end Document Recognition and Understanding with Dessurt
About
We introduce Dessurt, a relatively simple document understanding transformer capable of being fine-tuned on a greater variety of document tasks than prior methods. It receives a document image and task string as input and generates arbitrary text autoregressively as output. Because Dessurt is an end-to-end architecture that performs text recognition in addition to the document understanding, it does not require an external recognition model as prior methods do. Dessurt is a more flexible model than prior methods and is able to handle a variety of document domains and tasks. We show that this model is effective at 9 different dataset-task combinations.
Brian Davis, Bryan Morse, Bryan Price, Chris Tensmeyer, Curtis Wigington, Vlad Morariu• 2022
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Classification | RVL-CDIP (test) | Accuracy93.6 | 306 | |
| Document Visual Question Answering | DocVQA (test) | ANLS63.2 | 192 | |
| Document Visual Question Answering | DocVQA | ANLS63.2 | 164 | |
| Form Understanding | FUNSD (test) | F1 Score65 | 73 | |
| Visual Question Answering | DocVQA (val) | ANLS46.5 | 31 | |
| Handwriting Recognition | IAM page paragraph | CER4.8 | 6 | |
| Named Entity Recognition (18 classes) | IAM (RWTH) | Macro F140.4 | 5 | |
| Named Entity Recognition (18 classes) | IAM (Custom) | Macro F10.485 | 5 | |
| Named Entity Recognition (6 classes) | IAM (RWTH) | Macro F10.62 | 5 | |
| Named Entity Recognition (6 classes) | IAM (Custom) | Macro F171.5 | 5 |
Showing 10 of 11 rows