MinerU: An Open-Source Solution for Precise Document Content Extraction
About
Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Parsing | OmniDocBench v1.5 | Overall Score90.67 | 126 | |
| Document Parsing | olmOCR-bench | ArXiv Processing Accuracy76.6 | 36 | |
| OCR-related Parsing Tasks | OmniDocBench English | Edit Distance0.045 | 23 | |
| Document Parsing | OmniDocBench EN v1.0 | Overall Edit Distance0.111 | 15 | |
| Document Parsing | OmniDocBench ZH v1.0 | Overall Edit0.174 | 15 | |
| Optical Character Recognition | WuDao rendered text images (test) | ROUGE (R=2)0.967 | 9 | |
| Mathematical Expression Recognition | CMER-Bench Complex | ROUGE-169.38 | 8 | |
| Scientific document parsing | Uni-Parser Benchmark 1.0 (test) | Overall Accuracy (excl. Mol.)86.72 | 8 | |
| Mathematical Expression Recognition | CMER-Bench Moderate | ROUGE-10.5716 | 8 | |
| Mathematical Expression Recognition | CMER-Bench Easy | ROUGE-162.38 | 8 |