General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

About

Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang• 2024

Related benchmarks

Task	Dataset	Result
Multimodal Optical Character Recognition	OCRBench	Recognition Score245	66
Document Parsing	olmOCR-bench	ArXiv Processing Accuracy52.7	59
Table Extraction	100 pages (451 tables) synthetic (test)	LLM Score (Overall)5.13	21
Table Structure Recognition	PubTables-1M	--	20
Text Structural Anomaly Perception	Chinese recognition	Precision50	19
Canonical Text Recognition	English recognition	R61	19
Canonical Text Recognition	Chinese recognition	R85.3	19
Text Structural Anomaly Perception	English recognition	Precision0.00e+0	19
Document Retrieval	OHR-Bench Retrieval	Accuracy (Text)62.1	14
Document Text Generation	OHR-Bench Generation	Text Score37.5	14

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord