Visual Prompt Multi-Modal Tracking

About

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT.

Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, Huchuan Lu• 2023

Related benchmarks

Task	Dataset	Result
RGB-D Object Tracking	VOT-RGBD 2022 (public challenge)	EAO72.3	263
RGB-T Tracking	LasHeR (test)	PR65.1	257
RGB-T Tracking	RGBT234 (test)	Precision Rate83.6	203
RGB-D Object Tracking	DepthTrack (test)	Precision59.2	181
RGB-T Tracking	GTOT	PR91.4	138
RGB-T Tracking	RGBT234	Precision83.5	121
RGBT Tracking	LasHeR	PR65.1	120
RGBT Tracking	RGBT234	PR84.7	112
Visual Object Tracking	DepthTrack	Recall0.619	106
Object Tracking	COESOT (test)	SR68.3	69

Showing 10 of 58 rows

Other info

Code

Follow for update

@wizwand_team Discord