SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding
About
Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG), which can broaden horizons of any MLLM to support long-document understanding. We demonstrate that **MLLMs themselves can be an effective multimodal retriever** to fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of SV-RAG.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context document understanding | MMLongBench-Doc | Accuracy23 | 58 | |
| Document Visual Question Answering | SlideVQA | F1 Score0.343 | 32 | |
| Multi-page Document Question Answering | MP-DocVQA | ANLS71 | 27 | |
| Hierarchy Question Answering | MMDA | AIC Accuracy35.6 | 22 | |
| Multi-page Document Understanding | DUDE | ANLS45 | 21 | |
| Document Understanding | MPDocVQA | ANLS71 | 15 | |
| Location Question Answering | MMDA | AIC-Acc32.45 | 11 | |
| Text Question Answering | MMDA | AIC Accuracy43.92 | 11 |