Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

About

Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.

Wei Han, Hantao Huang, Tao Han• 2020

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA (val)
VQA Score41.02
309
Visual Question AnsweringTextVQA (test)
Accuracy41.41
124
Visual Question AnsweringOCR-VQA (test)
Accuracy64.1
77
Visual Question AnsweringTextVQA v1.0 (val)
Accuracy40.68
69
Scene Text Visual Question AnsweringST-VQA (val)
ANLS0.497
30
Visual Question AnsweringTextVQA v1.0 (test)
Accuracy40.54
27
Scene Text Visual Question AnsweringST-VQA (test)
ANLS0.485
21
Visual Question AnsweringOCR-VQA (val)
Accuracy63.8
17
Visual Question AnsweringST-VQA (test)
ANLS48.5
15
Scene Text Visual Question AnsweringST-VQA 1.0 (val)
ANLS49.7
15
Showing 10 of 13 rows

Other info

Follow for update