Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
About
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Classification | RVL-CDIP (test) | Accuracy95.52 | 306 | |
| Document Visual Question Answering | DocVQA (test) | ANLS87.05 | 192 | |
| Information Extraction | CORD (test) | F1 Score96.33 | 133 | |
| Information Extraction | SROIE (test) | F1 Score98.1 | 58 | |
| Visual Question Answering | ChartQA (test) | Accuracy70.4 | 58 | |
| Semantic Entity Recognition | CORD | F1 Score95.11 | 55 | |
| Document Question Answering | DocVQA | ANLS87.05 | 52 | |
| Entity recognition | CORD official (test) | F1 Score96.33 | 37 | |
| Visual Question Answering | DocVQA | ANLS87.1 | 32 | |
| Document Image Classification | RVL-CDIP 1.0 (test) | Accuracy95.52 | 25 |