Retrieval Augmented Visual Question Answering with Outside Knowledge

About

Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQA task that requires retrieval of external knowledge to answer questions about images. Recent OK-VQA systems use Dense Passage Retrieval (DPR) to retrieve documents from external knowledge bases, such as Wikipedia, but with DPR trained separately from answer generation, introducing a potential limit on the overall system performance. Instead, we propose a joint training scheme which includes differentiable DPR integrated with answer generation so that the system can be trained in an end-to-end fashion. Our experiments show that our scheme outperforms recent OK-VQA systems with strong DPR for retrieval. We also introduce new diagnostic metrics to analyze how retrieval and generation interact. The strong retrieval ability of our model significantly reduces the number of retrieved documents needed in training, yielding significant benefits in answer quality and computation required for training.

Weizhe Lin, Bill Byrne• 2022

Related benchmarks

Task	Dataset	Result
Visual Question Answering	Enc-VQA (test)	Single-Hop Accuracy36.6	84
Visual Question Answering	OK-VQA v1.0 (test)	Accuracy52.98	77
Visual Question Answering	InfoSeek (Full)	Accuracy17.2	61
External Knowledge-dependent Image Question Answering	OK-VQA	Accuracy54.5	49
Visual Question Answering	InfoSeek	Unseen-Q Score26.1	49
Visual Question Answering	Encyclopedic-VQA Full	Accuracy34.1	35
Knowledge-based Visual Question Answering	OK-VQA	VQA Score54.5	32
Visual Question Answering	OK-VQA v1.1 (test)	VQA Score54.48	28
Visual Question Answering	OK-VQA standard (test)	VQA Accuracy51.22	19
Visual Question Answering	E-VQA	Accuracy (All)20	19

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord