Lightweight and Production-Ready PDF Visual Element Parsing

About

PDF documents contain critical visual elements such as figures, tables, and forms whose accurate extraction is essential for document understanding and multimodal retrieval-augmented generation (RAG). Existing PDF parsers often miss complex visuals, extract non-informative artifacts (e.g., watermarks, logos), produce fragmented elements, and fail to reliably associate captions with their corresponding elements, which degrades downstream retrieval and question answering. We present a lightweight and production level PDF parsing framework that can accurately detect visual elements and associates captions using a combination of spatial heuristics, layout analysis, and semantic similarity. On popular benchmark datasets and internal product data, the proposed solution achieves $\geq96\%$ visual element detection accuracy and $93\%$ caption association accuracy. When used as a preprocessing step for multimodal RAG, it significantly outperforms state-of-the-art parsers and large vision-language models on both internal data and the MMDocRAG benchmark, while reducing latency by over $2\times$. We have deployed the proposed system in challenging production environment.

Meizhu Liu, Yassi Abbasi, Matthew Rowe, Michael Avendi, Paul Li• 2026

Related benchmarks

Task	Dataset	Result
PDF Parsing	Internal PDF Parsing Dataset	Text Extraction Accuracy99	6
Document Layout Detection	PDF Parsing Evaluation Set	Table BBA96	4
Image Captioning	PDF Parsing Evaluation Set	Caption Similarity0.93	4
Text Extraction	PDF Parsing Evaluation Set	Text Accuracy99	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord