Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

About

Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA dataset.

Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, Bill Byrne• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringInfoSeek (test)
Accuracy32.1
60
Visual Question AnsweringE-VQA (test)
Accuracy54.5
56
Visual Question AnsweringOK-VQA v1.1 (test)
VQA Score62.08
28
Knowledge-based Visual RetrievalOKVQA Google Search (test)
PR@570.63
16
Multi-modal knowledge base retrievalReMuQ (test)
R@562.76
14
Visual Question AnsweringOKVQA (test)
Accuracy61.9
11
Knowledge retrievalOK-VQA v1.1 (test)
Recall@589.32
10
Knowledge-based Visual RetrievalReMuQ 1.0 (test)
MRR@566.67
8
Knowledge-based Visual Question AnsweringOKVQA M2KR
VQA Score0.6075
6
Knowledge-based Visual RetrievalOKVQA WK11M (test)
MRR@532.56
6
Showing 10 of 16 rows

Other info

Code

Follow for update