LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

About

Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. The pre-trained LayoutXLM model and the XFUND dataset are publicly available at https://aka.ms/layoutxlm.

Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei• 2021

Related benchmarks

Task	Dataset	Result
Document Classification	RVL-CDIP (test)	Accuracy95.21	306
Entity extraction	FUNSD (test)	Entity F1 Score79.4	104
Semantic Entity Recognition	CORD	F1 Score94.81	55
Entity Linking	FUNSD (test)	F1 Score54.83	42
Semantic Entity Recognition	FUNSD (test)	F1 Score80.34	37
Semantic Entity Recognition	FUNSD	EN Score79.4	31
Relation Extraction	FUNSD	EN Performance Score66.71	16
Pair Extraction	RFUND-EN (test)	F1 Score52.98	16
Relation Extraction	XFUND v1.0 (test)	FUNSD Score0.6404	12
Semantic Entity Recognition	XFUND v1.0 (test)	FUNSD Score82.25	12

Showing 10 of 21 rows

Other info

Code

Follow for update

@wizwand_team Discord