LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
About
Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The pre-trained LayoutXLM model and the XFUND dataset are publicly available at https://aka.ms/layoutxlm.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Classification | RVL-CDIP (test) | Accuracy95.21 | 306 | |
| Entity extraction | FUNSD (test) | Entity F1 Score79.4 | 104 | |
| Semantic Entity Recognition | CORD | F1 Score94.81 | 55 | |
| Entity Linking | FUNSD (test) | F1 Score54.83 | 42 | |
| Semantic Entity Recognition | FUNSD (test) | F1 Score80.34 | 37 | |
| Semantic Entity Recognition | FUNSD | EN Score79.4 | 31 | |
| Relation Extraction | FUNSD | EN Performance Score66.71 | 16 | |
| Pair Extraction | RFUND-EN (test) | F1 Score52.98 | 16 | |
| Relation Extraction | XFUND v1.0 (test) | FUNSD Score0.6404 | 12 | |
| Semantic Entity Recognition | XFUND v1.0 (test) | FUNSD Score82.25 | 12 |