Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

About

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, Can Huang• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy49.4
1117
Visual Question AnsweringChartQA--
239
Visual Question AnsweringAI2D
Accuracy63.3
174
Document Visual Question AnsweringDocVQA
ANLS83.3
164
Document ParsingOmniDocBench v1.5
Overall Score83.21
126
Optical Character RecognitionOCRBench
OCRBench Score50.9
83
Infographic Question AnsweringInfoVQA
ANLS42
54
Web agent tasksMind2Web Cross-Task
Element Accuracy24
49
Web agent tasksMind2Web (Cross-Website)
Element Accuracy20.9
40
Web agent tasksMind2Web Cross-Domain
Ele.Acc19.3
37
Showing 10 of 32 rows

Other info

Follow for update