iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection

About

Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can do harm to the sharing of task-relevant information. In this paper, we propose a novel VPT approach, \textbf{iVPT}. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, by automatically identifying salient image tokens, which are further enhanced by prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantage of the proposed iVPT, compared to the state-of-the-art counterparts.

Nan Zhou, Jiaxin Chen, Di Huang• 2024

Related benchmarks

Task	Dataset	Result
Diagram Question Answering	AI2D	--	509
Visual Question Answering	GQA	GQA Score63.26	152
Text-based Visual Question Answering	TextVQA	Score57.89	112
Visual Question Answering	COCO	Score65.34	106
Multimodal Visual Perception	MMVP	Accuracy29.33	106
Multimodal Perception Assessment	MME Perception	MME-P1.43e+3	77
Science Question Answering	ScienceQA image	Score68.82	70
Multimodal Question Answering	MMBench CN	Accuracy57.9	61
Real-world Question Answering	RealworldQA	Overall Score56.86	58
Multimodal Understanding	Aggregate LLaVA 1.5 Suite	Relative Average Score58.7	39

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord