MARA: A Multimodal Adaptive Retrieval-Augmented Framework for Document Question Answering

About

Retrieval-based multimodal document QA aims to identify and integrate relevant information from visually rich documents with complex multimodal structures. While retrieval-augmented generation (RAG) has shown strong performance in text-based QA, its extensions to multimodal documents remain underexplored and face significant limitations. Specifically, current approaches rely on query-agnostic document representations that overlook salient content and use static top-k evidence selection, which fails to adapt to the uncertain distribution of relevant information. To address these limitations, we propose the Multimodal Adaptive Retrieval-Augmented (MARA) framework, which introduces query-adaptive mechanisms to both retrieval and generation. MARA consists of two components: a Query-Aligned Region Encoder that builds multi-level document representations and reweights them based on query relevance to improve retrieval precision; and a Self-Reflective Evidence Controller that monitors evidence sufficiency during generation and adaptively incorporates content from lower-ranked sources using a sliding-window strategy. Experiments on six multimodal QA benchmarks demonstrate that MARA consistently improves retrieval relevance and answer quality over existing SOTA method.

Hui Wu, Haoquan Zhai, Yuchen Li, Hengyi Cai, Peirong Zhang, Yidan Zhang, Lei Wang, Chunle Wang, Yingyan Hou, Shuaiqiang Wang, Dawei Yin• 2026

Related benchmarks

Task	Dataset	Result
Chart Question Answering	ChartQA (test)	Accuracy53.64	196
Document Question Answering	DocVQA (test)	Accuracy84.64	92
Document Question Answering	SlideVQA (test)	--	19
Document Question Answering	ArxivQA (test)	Accuracy71.85	14
Document Question Answering	PlotQA (test)	Accuracy44.57	14
Document Question Answering	InfoVQA (test)	Accuracy68.02	14
Multimodal Document Retrieval	ArxiVQA	MRR72.02	6
Multimodal Document Retrieval	InfoVQA	MRR87.93	6
Multimodal Document Retrieval	DocVQA	MRR83.35	6
Multimodal Document Retrieval	PlotQA	MRR43.23	6

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord