Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DMAP: Human-Aligned Structural Document Map for Multimodal Document Understanding

About

Existing multimodal document question-answering (QA) systems predominantly rely on flat semantic retrieval, representing documents as a set of disconnected text chunks and largely neglecting their intrinsic hierarchical and relational structures. Such flattening disrupts logical and spatial dependencies - such as section organization, figure-text correspondence, and cross-reference relations, that humans naturally exploit for comprehension. To address this limitation, we introduce a document-level structural Document MAP (DMAP), which explicitly encodes both hierarchical organization and inter-element relationships within multimodal documents. Specifically, we design a Structured-Semantic Understanding Agent to construct DMAP by organizing textual content together with figures, tables, charts, etc. into a human-aligned hierarchical schema that captures both semantic and layout dependencies. Building upon this representation, a Reflective Reasoning Agent performs structure-aware and evidence-driven reasoning, dynamically assessing the sufficiency of retrieved context and iteratively refining answers through targeted interactions with DMAP. Extensive experiments on MMDocQA benchmarks demonstrate that DMAP yields document-specific structural representations aligned with human interpretive patterns, substantially enhancing retrieval precision, reasoning consistency, and multimodal comprehension over conventional RAG-based approaches. Code is available at https://github.com/Forlorin/DMAP

ShunLiang Fu, Yanxin Zhang, Yixin Xiang, Xiaoyu Du, Jinhui Tang• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal Document Question AnsweringLongDocURL
Overall Acc60.7
21
Multimodal Document Question AnsweringMMLongBench
Accuracy43.2
19
Multimodal Document Question AnsweringPaperTab
Accuracy39
5
Multimodal Document Question AnsweringPaperText
Accuracy56.7
5
Multimodal Document Question AnsweringFetaTab
Accuracy72.5
5
Showing 5 of 5 rows

Other info

Follow for update