Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

About

Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Document ParsingolmOCR-bench
ArXiv Processing Accuracy52.7
36
Multimodal Optical Character RecognitionOCRBench
Recognition Score245
34
Text Structural Anomaly PerceptionChinese recognition
Precision50
19
Canonical Text RecognitionEnglish recognition
R61
19
Canonical Text RecognitionChinese recognition
R85.3
19
Text Structural Anomaly PerceptionEnglish recognition
Precision0.00e+0
19
Document RetrievalOHR-Bench Retrieval
Accuracy (Text)62.1
14
Document Text GenerationOHR-Bench Generation
Text Score37.5
14
Textual RAGOHR-Bench (Overall)
TXT Score0.353
14
Document Parsingmedical invoice (test)
FMR0.4932
10
Showing 10 of 10 rows

Other info

Follow for update