Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Visual Prompt Multi-Modal Tracking

About

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT.

Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, Huchuan Lu• 2023

Related benchmarks

TaskDatasetResultRank
RGB-D Object TrackingVOT-RGBD 2022 (public challenge)
EAO72.3
263
RGB-T TrackingLasHeR (test)
PR65.1
257
RGB-T TrackingRGBT234 (test)
Precision Rate83.6
203
RGB-D Object TrackingDepthTrack (test)
Precision59.2
181
RGB-T TrackingGTOT
PR91.4
138
RGB-T TrackingRGBT234
Precision83.5
121
RGBT TrackingLasHeR
PR65.1
120
RGBT TrackingRGBT234
PR84.7
112
Visual Object TrackingDepthTrack
Recall0.619
91
Object TrackingVisEvent (test)
PR75.8
63
Showing 10 of 44 rows

Other info

Code

Follow for update