XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding

About

Recently, various multimodal networks for Visually-Rich Document Understanding(VRDU) have been proposed, showing the promotion of transformers by integrating visual and layout information with the text embeddings. However, most existing approaches utilize the position embeddings to incorporate the sequence information, neglecting the noisy improper reading order obtained by OCR tools. In this paper, we propose a robust layout-aware multimodal network named XYLayoutLM to capture and leverage rich layout information from proper reading orders produced by our Augmented XY Cut. Moreover, a Dilated Conditional Position Encoding module is proposed to deal with the input sequence of variable lengths, and it additionally extracts local layout information from both textual and visual modalities while generating position embeddings. Experiment results show that our XYLayoutLM achieves competitive results on document understanding tasks.

Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, Liqing Zhang• 2022

Related benchmarks

Task	Dataset	Result
Entity extraction	FUNSD (test)	Entity F1 Score83.35	104
Form Understanding	FUNSD (test)	F1 Score83.35	73
Semantic Entity Recognition	FUNSD	--	31
Entity extraction	XFUND (test)	F1 Score91.76	9
Entity Linking	XFUND (test)	F1 Score67.79	8
Relation Extraction	XFUN v1 (test)	Avg F167.79	5
Semantic Entity Recognition	XFUN v1 (test)	XFUN Avg. F182.04	5
Key Information Extraction	XFUND zh	SER Hmean91.76	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord