MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
About
Current document parsing methods advance primarily through model architecture innovation, while systematic engineering of training data remains underexplored. Yet state-of-the-art models spanning diverse architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than from architectural differences. Building on this finding, we present MinerU2.5-Pro, which advances the state of the art purely through data engineering and training strategy design while retaining the 1.2B-parameter architecture of MinerU2.5 unchanged. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while mitigating distribution shift; Cross-Model Consistency Verification leverages output consensus among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy--large-scale pre-training, hard sample fine-tuning, and GRPO alignment--sequentially exploits these data at different quality tiers. On the evaluation front, we rectify element-matching biases in OmniDocBench v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench v1.6 protocol. Without any architectural modification, MinerU2.5-Pro achieves 95.69 on OmniDocBench v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods, including those based on models with over 200x more parameters.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Parsing | OmniDocBench Full v1.6 | Overall Accuracy95.69 | 21 | |
| Formula Recognition | OmniDoc Base | CDM99.2 | 9 | |
| Formula Recognition | OmniDoc Hard | CDM98.79 | 9 | |
| Formula Recognition | UniMERNet CPE | CDM98.97 | 9 | |
| Formula Recognition | UniMERNet SPE | CDM99.44 | 9 | |
| Formula Recognition | LaTeX-80M | CDM97.23 | 9 | |
| Text Recognition | OmniDocBench Base v1.6 | Edit Distance1.5 | 9 | |
| Text Recognition | OmniDocBench v1.6 (Hard) | Edit Distance4.8 | 9 | |
| Text Recognition | OmniDocBench v1.6 (Full) | Edit Distance1.9 | 9 | |
| Formula Recognition | UniMERNet HWE | CDM95.38 | 9 |