StructuralLM: Structural Pre-training for Form Understanding

About

Large pre-trained language models achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, they almost exclusively focus on text-only representation, while neglecting cell-level layout information that is important for form image understanding. In this paper, we propose a new pre-training approach, StructuralLM, to jointly leverage cell and layout information from scanned documents. Specifically, we pre-train StructuralLM with two new designs to make the most of the interactions of cell and layout information: 1) each cell as a semantic unit; 2) classification of cell positions. The pre-trained StructuralLM achieves new state-of-the-art results in different types of downstream tasks, including form understanding (from 78.95 to 85.14), document visual question answering (from 72.59 to 83.94) and document image classification (from 94.43 to 96.08).

Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, Luo Si• 2021

Related benchmarks

Task	Dataset	Result
Document Classification	RVL-CDIP (test)	Accuracy96.08	306
Document Visual Question Answering	DocVQA (test)	ANLS83.94	292
Entity extraction	FUNSD (test)	Entity F1 Score85.14	104
Form Understanding	FUNSD (test)	F1 Score85.14	73
Document Question Answering	DocVQA	ANLS83.49	64
Information Extraction	FUNSD (test)	F1 Score85.14	55
Semantic Entity Recognition	FUNSD	--	31
Document Image Classification	RVL-CDIP 1.0 (test)	Accuracy96.08	25
Document Understanding	DUE Benchmark	DocVQA83.9	24
Information Extraction	FUNSD v1 (test)	F1 Score85.14	13

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord