MARA: A Multimodal Adaptive Retrieval-Augmented Framework for Document Question Answering
About
Retrieval-based multimodal document QA aims to identify and integrate relevant information from visually rich documents with complex multimodal structures. While retrieval-augmented generation (RAG) has shown strong performance in text-based QA, its extensions to multimodal documents remain underexplored and face significant limitations. Specifically, current approaches rely on query-agnostic document representations that overlook salient content and use static top-k evidence selection, which fails to adapt to the uncertain distribution of relevant information. To address these limitations, we propose the Multimodal Adaptive Retrieval-Augmented (MARA) framework, which introduces query-adaptive mechanisms to both retrieval and generation. MARA consists of two components: a Query-Aligned Region Encoder that builds multi-level document representations and reweights them based on query relevance to improve retrieval precision; and a Self-Reflective Evidence Controller that monitors evidence sufficiency during generation and adaptively incorporates content from lower-ranked sources using a sliding-window strategy. Experiments on six multimodal QA benchmarks demonstrate that MARA consistently improves retrieval relevance and answer quality over existing SOTA method.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Chart Question Answering | ChartQA (test) | Accuracy53.64 | 190 | |
| Document Question Answering | DocVQA (test) | Accuracy84.64 | 92 | |
| Document Question Answering | SlideVQA (test) | -- | 19 | |
| Document Question Answering | ArxivQA (test) | Accuracy71.85 | 14 | |
| Document Question Answering | PlotQA (test) | Accuracy44.57 | 14 | |
| Document Question Answering | InfoVQA (test) | Accuracy68.02 | 14 | |
| Multimodal Document Retrieval | ArxiVQA | MRR72.02 | 6 | |
| Multimodal Document Retrieval | InfoVQA | MRR87.93 | 6 | |
| Multimodal Document Retrieval | DocVQA | MRR83.35 | 6 | |
| Multimodal Document Retrieval | PlotQA | MRR43.23 | 6 |