SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation
About
With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs yields better RAG performance, but processing rich documents remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN (SemantiC Document Layout ANalysis), a novel approach that enhances both textual and visual Retrieval-Augmented Generation (RAG) systems that work with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering contiguous components. We trained the SCAN model by fine-tuning object detection models on an annotated dataset. Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.4 points and visual RAG performance by up to 10.4 points, outperforming conventional approaches and even commercial document processing solutions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Retrieval | OHR-Bench Retrieval | Accuracy (Text)75.7 | 14 | |
| Document Text Generation | OHR-Bench Generation | Text Score48.4 | 14 | |
| Textual RAG | OHR-Bench (Overall) | TXT Score0.444 | 14 | |
| Visual RAG | OHR-Bench (test) | TXT Score86 | 5 | |
| Visual RAG | BizMMRAG | Score (TXT)75 | 5 | |
| Visual RAG | Allganize | TXT Score84.4 | 5 | |
| Textual RAG | BizMMRAG Japanese (test) | TXT Score81.7 | 5 | |
| Textual RAG | Allganize Japanese (test) | TXT Score85.9 | 5 |