M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

About

In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document's true structure. Existing text-centric chunkers and generative hierarchy parsers often miss cross-page parent-child relations, figure/table-caption bindings, and boundary cues, which leads to fragmented or redundant chunks and degrades both retrieval and answer quality. We propose M3DocDep, an LVLM-based pipeline that first recovers block-level dependencies and then constructs chunks along the recovered document tree. The pipeline uses SharedDet as a common DP+OCR preprocessing layer, extracts multimodal block embeddings with boundary-aware SoftROI pooling, scores candidate parent-child edges with a biaffine head, decodes a globally valid dependency tree with MST constraints, and builds tree-guided chunks annotated with section paths and page ranges. Under a shared-block evaluation protocol, M3DocDep improves STEDS by +28.5 to +39.6 percent on DHP benchmarks, retrieval nDCG by +1.1 to +15.3 percent, and QA ANLS by +4.5 to +15.3 percent on corpus-level RAG benchmarks. These results show that recovering document dependencies before chunking yields more coherent retrieval units for long, multi-page multimodal documents.

Joongmin Shin, Jeongbae Park, Jaehyung Seo, Heuiseok Lim• 2026

Related benchmarks

Task	Dataset	Result
Document Retrieval	DUDE	--	32
Question Answering	DUDE	ANLS21.43	13
Question Answering	MOAMOB	ANLS27.14	13
Retrieval	CUAD	Recall91.25	13
Retrieval	MOAMOB	Recall76.97	13
Question Answering	CUAD	ANLS29.25	13
Hierarchy Recovery	HRDS	F1 Score82.87	10
Hierarchy Recovery	HRDH	F1 Score (HRDH)77.75	10
Hierarchy Recovery	DocHieNet	F1 Score76.01	10
Question Answering	MP-DocVQA	ANLS18.17	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord