M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
About
In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document's true structure. Existing text-centric chunkers and generative hierarchy parsers often miss cross-page parent-child relations, figure/table-caption bindings, and boundary cues, which leads to fragmented or redundant chunks and degrades both retrieval and answer quality. We propose M3DocDep, an LVLM-based pipeline that first recovers block-level dependencies and then constructs chunks along the recovered document tree. The pipeline uses SharedDet as a common DP+OCR preprocessing layer, extracts multimodal block embeddings with boundary-aware SoftROI pooling, scores candidate parent-child edges with a biaffine head, decodes a globally valid dependency tree with MST constraints, and builds tree-guided chunks annotated with section paths and page ranges. Under a shared-block evaluation protocol, M3DocDep improves STEDS by +28.5 to +39.6 percent on DHP benchmarks, retrieval nDCG by +1.1 to +15.3 percent, and QA ANLS by +4.5 to +15.3 percent on corpus-level RAG benchmarks. These results show that recovering document dependencies before chunking yields more coherent retrieval units for long, multi-page multimodal documents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Retrieval | DUDE | -- | 32 | |
| Question Answering | DUDE | ANLS21.43 | 13 | |
| Question Answering | MOAMOB | ANLS27.14 | 13 | |
| Retrieval | CUAD | Recall91.25 | 13 | |
| Retrieval | MOAMOB | Recall76.97 | 13 | |
| Question Answering | CUAD | ANLS29.25 | 13 | |
| Hierarchy Recovery | HRDS | F1 Score82.87 | 10 | |
| Hierarchy Recovery | HRDH | F1 Score (HRDH)77.75 | 10 | |
| Hierarchy Recovery | DocHieNet | F1 Score76.01 | 10 | |
| Question Answering | MP-DocVQA | ANLS18.17 | 7 |