DMAP: Human-Aligned Structural Document Map for Multimodal Document Understanding

About

Existing multimodal document question-answering (QA) systems predominantly rely on flat semantic retrieval, representing documents as a set of disconnected text chunks and largely neglecting their intrinsic hierarchical and relational structures. Such flattening disrupts logical and spatial dependencies - such as section organization, figure-text correspondence, and cross-reference relations, that humans naturally exploit for comprehension. To address this limitation, we introduce a document-level structural Document MAP (DMAP), which explicitly encodes both hierarchical organization and inter-element relationships within multimodal documents. Specifically, we design a Structured-Semantic Understanding Agent to construct DMAP by organizing textual content together with figures, tables, charts, etc. into a human-aligned hierarchical schema that captures both semantic and layout dependencies. Building upon this representation, a Reflective Reasoning Agent performs structure-aware and evidence-driven reasoning, dynamically assessing the sufficiency of retrieved context and iteratively refining answers through targeted interactions with DMAP. Extensive experiments on MMDocQA benchmarks demonstrate that DMAP yields document-specific structural representations aligned with human interpretive patterns, substantially enhancing retrieval precision, reasoning consistency, and multimodal comprehension over conventional RAG-based approaches. Code is available at https://github.com/Forlorin/DMAP

ShunLiang Fu, Yanxin Zhang, Yixin Xiang, Xiaoyu Du, Jinhui Tang• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Document Question Answering	LongDocURL	Overall Acc60.7	21
Multimodal Document Question Answering	MMLongBench	Accuracy43.2	19
Multimodal Document Question Answering	PaperTab	Accuracy39	12
Multimodal Document Question Answering	FetaTab	Accuracy72.5	12
Multimodal Document Question Answering	PaperText	Accuracy56.7	5

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord