ABot-OCR Technical Report

About

We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

Kaitao Jiang, Ruiyan Gong, Xiaolong Cheng, Kangning Niu, Tianlun Li, Mu Xu• 2026

Related benchmarks

Task	Dataset	Result
Document Parsing	OmniDocBench 1.5 (test)	Text Edit Error0.034	132
Document Parsing	OmniDocBench Full v1.6	Overall Accuracy93.3	44
Document Parsing	OmniDocBench v1.6	Overall Score93.3	26
Document Parsing	Multilingual Document Parsing Dataset	Performance (Arabic)1.8	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord