Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

About

We introduce MonkeyOCR, a document parsing model that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline and avoids the inefficiencies of processing full pages with giant end-to-end models. In SRR, document parsing is abstracted into three fundamental questions - ``Where is it?'' (structure), ``What is it?'' (recognition), and ``How is it organized?'' (relation) - corresponding to structure detection, content recognition, and relation prediction. To support this paradigm, we present MonkeyDoc, a comprehensive dataset with 4.5 million bilingual instances spanning over ten document types, which addresses the limitations of existing datasets that often focus on a single task, language, or document type. Leveraging the SRR paradigm and MonkeyDoc, we trained a 3B-parameter document foundation model. We further identify parameter redundancy in this model and propose contiguous parameter degradation (CPD), enabling the construction of models from 0.6B to 1.2B parameters that run faster with acceptable performance drop. MonkeyOCR achieves state-of-the-art performance, surpassing previous open-source and closed-source methods, including Gemini 2.5-Pro. Additionally, the model can be efficiently deployed for inference on a single RTX 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Biao Yang, Zidun Guo, Jiarui Zhang, Xinyu Wang, Xiang Bai• 2025

Related benchmarks

TaskDatasetResultRank
Document ParsingOmniDocBench v1.5
Overall Score88.85
126
Document ParsingolmOCR-bench
ArXiv Processing Accuracy83.8
36
Reading Order DetectionOmniDocBench ZH v1.0
Edit Distance0.185
28
Reading Order DetectionOmniDocBench EN v1.0
Edit Distance0.1
28
Document ParsingOmniDocBench 1.5 (test)
Overall Score88.85
27
OCR-related Parsing TasksOmniDocBench English
Edit Distance0.058
23
Reading Order DetectionOmniDocBench v1.5
Edit Distance0.128
21
Text Structural Anomaly PerceptionChinese recognition
Precision100
19
Document ParsingReal5-OmniDocBench scanning scenario 1.5 (test)
Overall Score86.94
19
Canonical Text RecognitionChinese recognition
R91.7
19
Showing 10 of 24 rows

Other info

Follow for update