Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition

About

Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous schemas, previous works usually design task-specific architectures and objectives for individual tasks, which inadvertently leads to modal isolation and complex workflow. In this paper, we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition. In OmniParser, all tasks share the unified encoder-decoder architecture, the unified objective: point-conditioned text generation, and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery.

Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, Zhibo Yang• 2024

Related benchmarks

TaskDatasetResultRank
Text DetectionICDAR 2015
Precision90.3
188
Text DetectionTotal-Text
Precision88.4
160
End-to-End Text SpottingICDAR 2015
Strong Score89.6
104
Text DetectionCTW1500
F-measure87.8
98
End-to-End Scene Text SpottingTotal-Text
Hmean (None)84
80
Table RecognitionPubTabNet (test)
TEDS (Simple)90.5
70
Text SpottingCTW1500
E2E Score (None)66.8
41
Table Structure RecognitionFinTabNet
S-TEDS91.55
40
GUI NavigationMultimodal-Mind2Web Cross-Website
Step Success Rate36.5
37
Table Structure RecognitionPubTabNet
S-TEDS91.55
37
Showing 10 of 18 rows

Other info

Follow for update