Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

About

Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The pre-trained LayoutXLM model and the XFUND dataset are publicly available at https://aka.ms/layoutxlm.

Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei• 2021

Related benchmarks

TaskDatasetResultRank
Document ClassificationRVL-CDIP (test)
Accuracy95.21
306
Entity extractionFUNSD (test)
Entity F1 Score79.4
104
Semantic Entity RecognitionCORD
F1 Score94.81
55
Entity LinkingFUNSD (test)
F1 Score54.83
42
Semantic Entity RecognitionFUNSD (test)
F1 Score80.34
37
Semantic Entity RecognitionFUNSD
EN Score79.4
31
Relation ExtractionFUNSD
EN Performance Score66.71
16
Pair ExtractionRFUND-EN (test)
F1 Score52.98
16
Relation ExtractionXFUND v1.0 (test)
FUNSD Score0.6404
12
Semantic Entity RecognitionXFUND v1.0 (test)
FUNSD Score82.25
12
Showing 10 of 21 rows

Other info

Code

Follow for update