OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition

About

Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous schemas, previous works usually design task-specific architectures and objectives for individual tasks, which inadvertently leads to modal isolation and complex workflow. In this paper, we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition. In OmniParser, all tasks share the unified encoder-decoder architecture, the unified objective: point-conditioned text generation, and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery.

Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, Zhibo Yang• 2024

Related benchmarks

Task	Dataset	Result
Text Detection	ICDAR 2015	Precision90.3	188
Text Detection	Total-Text	Precision88.4	160
End-to-End Text Spotting	ICDAR 2015	Strong Score89.6	104
Text Detection	CTW1500	F-measure87.8	98
End-to-End Scene Text Spotting	Total-Text	Hmean (None)84	80
Table Recognition	PubTabNet (test)	TEDS (Simple)90.5	70
Text Spotting	CTW1500	E2E Score (None)66.8	41
Table Structure Recognition	FinTabNet	S-TEDS91.55	40
GUI Navigation	Multimodal-Mind2Web Cross-Website	Step Success Rate36.5	37
Table Structure Recognition	PubTabNet	S-TEDS91.55	37

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord