TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

About

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. These two tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to the conventional vision-language pre-training that fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly incorporates scene text (generated from OCR engines) in pre-training. With three pre-training tasks, including masked language modeling (MLM), image-text (contrastive) matching (ITM), and relative (spatial) position prediction (RPP), TAP effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text. Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5.4%, compared with a non-TAP baseline. To further improve the performance, we build a large-scale dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1.4 million scene text-related image-text pairs. Pre-trained on this OCR-CC dataset, our approach outperforms the state of the art by large margins on multiple tasks, i.e., +8.3% accuracy on TextVQA, +8.6% accuracy on ST-VQA, and +10.2 CIDEr score on TextCaps.

Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo• 2020

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA (val)	VQA Score54.71	365
Visual Question Answering	TextVQA (test)	Accuracy54	124
Image Captioning	TextCaps	CIDEr99.5	112
Visual Question Answering	TextVQA v1.0 (val)	Accuracy54.71	84
Image Captioning	TextCaps (val)	CIDEr119	51
Image Captioning	TextCaps (test)	CIDEr109.7	50
Visual Question Answering	TextVQA v1.0 (test)	Accuracy53.97	40
Scene Text Visual Question Answering	ST-VQA (val)	ANLS0.598	30
Scene Text Visual Question Answering	ST-VQA (test)	ANLS0.597	21
Visual Question Answering	ST-VQA (test)	ANLS59.7	15

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord