TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

About

Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the long visual sequence. To tackle this problem, in this paper, we propose an efficient vision-and-language pre-training model with \textbf{T}ext-\textbf{R}elevant \textbf{I}mage \textbf{P}atch \textbf{S}election, namely TRIPS, which reduces the visual sequence progressively with a text-guided patch-selection layer in the visual backbone for efficient training and inference. The patch-selection layer can dynamically compute text-dependent visual attention to identify the attentive image tokens with text guidance and fuse inattentive ones in an end-to-end manner. Meanwhile, TRIPS does not introduce extra parameters to ViTs. Experimental results on a variety of popular benchmark datasets demonstrate that TRIPS gain a speedup of 40\% over previous similar VLP models, yet with competitive or better downstream task performance.

Chaoya Jiang, Haiyang Xu, Chenliang Li, Miang Yan, Wei Ye, Shikun Zhang, Bin Bi, Songfang Huang• 2023

Related benchmarks

Task	Dataset	Result
Natural Language Visual Reasoning	NLVR2 (test-p)	Accuracy74.28	346
Natural Language Visual Reasoning	NLVR2 (dev)	Accuracy73.27	307
Image Retrieval	MS-COCO	R@141.25	172
Visual Question Answering	VQA (test-dev)	--	147
Text Retrieval	MSCOCO	Recall@154.52	142
Visual Question Answering	VQA (test-std)	Accuracy71.01	120

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord