UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities
About
Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, an any-to-any RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose modality-aware routing, which dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it, and further justify its effectiveness with a theoretical analysis. Moreover, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 10 benchmarks of multiple modalities, showing its superiority over various modality-specific and unified baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | LVBench | Accuracy19.1 | 108 | |
| Visual Question Answering | InfoSeek | Accuracy23.35 | 77 | |
| Document Visual Question Answering | SlideVQA | Accuracy0.172 | 53 | |
| Multiple-choice Question Answering | MMLU | MMLU Accuracy (Overall)74.62 | 52 | |
| General Text Question Answering | HotpotQA | Accuracy55.9 | 51 | |
| Multimodal Retrieval-Augmented Generation | MRAG | Score52.55 | 35 | |
| Multimodal Question Answering | Open-WikiTable | F1 Recall31.12 | 22 | |
| Multimodal Question Answering | WebQA | F1-Recall79.48 | 22 | |
| Multimodal Question Answering | 2WikiMQA | F1-Recall47.3 | 22 | |
| Visual Question Answering | InfoSeek | F1 Recall37.25 | 22 |