LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
About
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$ 0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520), RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made our model and code publicly available at \url{https://aka.ms/layoutlmv2}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Classification | RVL-CDIP (test) | Accuracy95.65 | 306 | |
| Document Visual Question Answering | DocVQA (test) | ANLS86.72 | 192 | |
| Information Extraction | CORD (test) | F1 Score97.24 | 133 | |
| Entity extraction | FUNSD (test) | Entity F1 Score84.2 | 104 | |
| Form Understanding | FUNSD (test) | F1 Score84.2 | 73 | |
| Information Extraction | SROIE (test) | F1 Score97.81 | 58 | |
| Information Extraction | FUNSD (test) | F1 Score84.2 | 55 | |
| Semantic Entity Recognition | CORD | F1 Score96.01 | 55 | |
| Entity Linking | FUNSD (test) | F1 Score70.57 | 42 | |
| Semantic Entity Recognition | FUNSD (test) | F1 Score84.2 | 37 |