Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding

About

Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG), which can broaden horizons of any MLLM to support long-document understanding. We demonstrate that **MLLMs themselves can be an effective multimodal retriever** to fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of SV-RAG.

Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan A. Rossi, Changyou Chen, Tong Sun• 2024

Related benchmarks

TaskDatasetResultRank
Long-context document understandingMMLongBench-Doc
Accuracy23
58
Document Visual Question AnsweringSlideVQA
F1 Score0.343
32
Multi-page Document Question AnsweringMP-DocVQA
ANLS71
27
Hierarchy Question AnsweringMMDA
AIC Accuracy35.6
22
Multi-page Document UnderstandingDUDE
ANLS45
21
Document UnderstandingMPDocVQA
ANLS71
15
Location Question AnsweringMMDA
AIC-Acc32.45
11
Text Question AnsweringMMDA
AIC Accuracy43.92
11
Showing 8 of 8 rows

Other info

Follow for update