Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ABot-OCR Technical Report

About

We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

Kaitao Jiang, Ruiyan Gong, Xiaolong Cheng, Kangning Niu, Tianlun Li, Mu Xu• 2026

Related benchmarks

TaskDatasetResultRank
Document ParsingOmniDocBench 1.5 (test)
Text Edit Error0.034
132
Document ParsingOmniDocBench Full v1.6
Overall Accuracy93.3
44
Document ParsingMultilingual Document Parsing Dataset
Performance (Arabic)1.8
4
Showing 3 of 3 rows

Other info

Follow for update