Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
About
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | InfoSeek (test) | Accuracy32.1 | 60 | |
| Visual Question Answering | E-VQA (test) | Accuracy54.5 | 56 | |
| Visual Question Answering | OK-VQA v1.1 (test) | VQA Score62.08 | 28 | |
| Knowledge-based Visual Retrieval | OKVQA Google Search (test) | PR@570.63 | 16 | |
| Multi-modal knowledge base retrieval | ReMuQ (test) | R@562.76 | 14 | |
| Visual Question Answering | OKVQA (test) | Accuracy61.9 | 11 | |
| Knowledge retrieval | OK-VQA v1.1 (test) | Recall@589.32 | 10 | |
| Knowledge-based Visual Retrieval | ReMuQ 1.0 (test) | MRR@566.67 | 8 | |
| Knowledge-based Visual Question Answering | OKVQA M2KR | VQA Score0.6075 | 6 | |
| Knowledge-based Visual Retrieval | OKVQA WK11M (test) | MRR@532.56 | 6 |