Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

About

Current document parsing methods advance primarily through model architecture innovation, while systematic engineering of training data remains underexplored. Yet state-of-the-art models spanning diverse architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than from architectural differences. Building on this finding, we present MinerU2.5-Pro, which advances the state of the art purely through data engineering and training strategy design while retaining the 1.2B-parameter architecture of MinerU2.5 unchanged. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while mitigating distribution shift; Cross-Model Consistency Verification leverages output consensus among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy--large-scale pre-training, hard sample fine-tuning, and GRPO alignment--sequentially exploits these data at different quality tiers. On the evaluation front, we rectify element-matching biases in OmniDocBench v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench v1.6 protocol. Without any architectural modification, MinerU2.5-Pro achieves 95.69 on OmniDocBench v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods, including those based on models with over 200x more parameters.

Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei, Jing Yu, Jie Yang, Wei Li, Shasha Wang, Qianqian Wu, Xuanhe Zhou, Weijia Li, Zhenxiang Li, Zhongying Tu, Jiang Wu, Lijun Wu, Chao Xu, Kai Chen, Wentao Zhang, Yu Qiao, Bowen Zhou, Dahua Lin, Conghui He• 2026

Related benchmarks

TaskDatasetResultRank
Document ParsingOmniDocBench Full v1.6
Overall Accuracy95.69
21
Formula RecognitionOmniDoc Base
CDM99.2
9
Formula RecognitionOmniDoc Hard
CDM98.79
9
Formula RecognitionUniMERNet CPE
CDM98.97
9
Formula RecognitionUniMERNet SPE
CDM99.44
9
Formula RecognitionLaTeX-80M
CDM97.23
9
Text RecognitionOmniDocBench Base v1.6
Edit Distance1.5
9
Text RecognitionOmniDocBench v1.6 (Hard)
Edit Distance4.8
9
Text RecognitionOmniDocBench v1.6 (Full)
Edit Distance1.9
9
Formula RecognitionUniMERNet HWE
CDM95.38
9
Showing 10 of 13 rows

Other info

Follow for update