PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
About
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios. Code is available at https://github.com/PaddlePaddle/PaddleOCR .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Parsing | OmniDocBench v1.5 | Overall Score92.86 | 126 | |
| Document Parsing | olmOCR-bench | ArXiv Processing Accuracy85.7 | 36 | |
| Reading Order Detection | OmniDocBench ZH v1.0 | Edit Distance0.063 | 28 | |
| Reading Order Detection | OmniDocBench EN v1.0 | Edit Distance0.045 | 28 | |
| Document Parsing | OmniDocBench 1.5 (test) | Overall Score92.86 | 27 | |
| Reading Order Detection | OmniDocBench v1.5 | Edit Distance0.043 | 21 | |
| Document Parsing | Real5-OmniDocBench scanning scenario 1.5 (test) | Overall Score92.11 | 19 | |
| Document Parsing | OmniDocBench Real5 illumination | Overall Score0.8961 | 19 | |
| Document Parsing | OmniDocBench Real5 warping | Overall Score85.97 | 19 | |
| Document Parsing | Real5-OmniDocBench 5-distortion types (test) | Overall Accuracy85.54 | 19 |