Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Visual Prompt Multi-Modal Tracking

About

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT.

Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, Huchuan Lu• 2023

Related benchmarks

TaskDatasetResultRank
RGB-T TrackingLasHeR (test)
PR65.1
244
RGB-T TrackingRGBT234 (test)
Precision Rate83.6
189
RGB-D Object TrackingVOT-RGBD 2022 (public challenge)
EAO72.1
167
RGB-D Object TrackingDepthTrack (test)
Precision59.2
145
RGB-T TrackingGTOT
PR91.4
114
RGB-T TrackingRGBT234
Precision83.5
98
RGBT TrackingRGBT234
PR83.5
65
Object TrackingVisEvent (test)
PR75.8
63
RGBT TrackingLasHeR
PR65.1
55
RGBT TrackingRGBT 234
Precision Rate83.5
53
Showing 10 of 31 rows

Other info

Code

Follow for update