VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
About
Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Document Question Answering | MMLongBench-Doc | Acc (TXT Evidence)29.87 | 30 | |
| Document Visual Question Answering | MMLongBench-Doc | Accuracy25.96 | 29 | |
| Visual Question Answering | SlideVQA | Single Accuracy73.79 | 28 | |
| Visual Question Answering | ViDoSeek | Single Accuracy0.6295 | 14 | |
| Video Document Seeking | ViDoSeek | Single Score18.76 | 14 | |
| Multimodal Document Reasoning | SlideVQA, MMLongBench-Doc, and ViDoSeek | Average Score39.57 | 14 | |
| Multimodal Document QA | VisDoMBench SPIQA (full) | Accuracy75.44 | 11 | |
| Multimodal Document QA | VisDoMBench PaperTab (full) | Accuracy56.21 | 11 | |
| Multimodal Document QA | VisDoMBench SlideVQA (full) | Accuracy69.03 | 11 | |
| Multimodal Document QA | VisDoMBench SciGraphQA (full) | Accuracy63.36 | 11 |