Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MinerU: An Open-Source Solution for Precise Document Content Extraction

About

Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU.

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, Conghui He• 2024

Related benchmarks

TaskDatasetResultRank
Document ParsingOmniDocBench v1.5
Overall Score90.67
126
Document ParsingolmOCR-bench
ArXiv Processing Accuracy76.6
36
OCR-related Parsing TasksOmniDocBench English
Edit Distance0.045
23
Document ParsingOmniDocBench EN v1.0
Overall Edit Distance0.111
15
Document ParsingOmniDocBench ZH v1.0
Overall Edit0.174
15
Optical Character RecognitionWuDao rendered text images (test)
ROUGE (R=2)0.967
9
Mathematical Expression RecognitionCMER-Bench Complex
ROUGE-169.38
8
Scientific document parsingUni-Parser Benchmark 1.0 (test)
Overall Accuracy (excl. Mol.)86.72
8
Mathematical Expression RecognitionCMER-Bench Moderate
ROUGE-10.5716
8
Mathematical Expression RecognitionCMER-Bench Easy
ROUGE-162.38
8
Showing 10 of 10 rows

Other info

Follow for update