Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

About

In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document's true structure. Existing text-centric chunkers and generative hierarchy parsers often miss cross-page parent-child relations, figure/table-caption bindings, and boundary cues, which leads to fragmented or redundant chunks and degrades both retrieval and answer quality. We propose M3DocDep, an LVLM-based pipeline that first recovers block-level dependencies and then constructs chunks along the recovered document tree. The pipeline uses SharedDet as a common DP+OCR preprocessing layer, extracts multimodal block embeddings with boundary-aware SoftROI pooling, scores candidate parent-child edges with a biaffine head, decodes a globally valid dependency tree with MST constraints, and builds tree-guided chunks annotated with section paths and page ranges. Under a shared-block evaluation protocol, M3DocDep improves STEDS by +28.5 to +39.6 percent on DHP benchmarks, retrieval nDCG by +1.1 to +15.3 percent, and QA ANLS by +4.5 to +15.3 percent on corpus-level RAG benchmarks. These results show that recovering document dependencies before chunking yields more coherent retrieval units for long, multi-page multimodal documents.

Joongmin Shin, Jeongbae Park, Jaehyung Seo, Heuiseok Lim• 2026

Related benchmarks

TaskDatasetResultRank
Document RetrievalDUDE--
32
Question AnsweringDUDE
ANLS21.43
13
Question AnsweringMOAMOB
ANLS27.14
13
RetrievalCUAD
Recall91.25
13
RetrievalMOAMOB
Recall76.97
13
Question AnsweringCUAD
ANLS29.25
13
Hierarchy RecoveryHRDS
F1 Score82.87
10
Hierarchy RecoveryHRDH
F1 Score (HRDH)77.75
10
Hierarchy RecoveryDocHieNet
F1 Score76.01
10
Question AnsweringMP-DocVQA
ANLS18.17
7
Showing 10 of 10 rows

Other info

Follow for update