Improving Visual Object Tracking through Visual Prompting
About
Learning a discriminative model that distinguishes the specified target from surrounding distractors across frames is essential for generic object tracking (GOT). Dynamic adaptation of target representation against distractors remains challenging because prevailing trackers exhibit limited discriminative capability. To address this issue, we present a new visual prompting mechanism for generic object tracking, termed PiVOT. PiVOT introduces mechanisms that leverage the pretrained foundation model (CLIP) to automatically generate and refine visual prompts online, thereby enabling the tracker to suppress distractors through contrastive guidance. To transfer contrastive knowledge from the foundation model to the tracker, PiVOT automatically propagates this knowledge online and dynamically generates and updates visual prompts. Specifically, it proposes a prompt initialization mechanism that produces an initial visual prompt highlighting potential target locations. The foundation model is then used to refine the prompt based on appearance similarities between candidate objects and reference templates across potential targets. After refinement, the visual prompt better highlights potential target locations and reduces irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate instance-aware feature maps guided by the visual prompts, which are incrementally and automatically updated during tracking, thereby effectively suppressing distractors. Extensive experiments across multiple benchmarks indicate that PiVOT, with the proposed prompting mechanism, can suppress distracting objects and improve tracking performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Object Tracking | TrackingNet (test) | Normalized Precision (Pnorm)90 | 463 | |
| Object Tracking | LaSoT | -- | 411 | |
| Visual Object Tracking | GOT-10k (test) | Average Overlap76.9 | 408 | |
| Object Tracking | TrackingNet | -- | 270 | |
| Visual Object Tracking | GOT-10k | AO76.9 | 254 | |
| Visual Object Tracking | UAV123 (test) | -- | 188 | |
| Visual Object Tracking | OTB100 (test) | -- | 41 | |
| Visual Object Tracking | AVisT (test) | AUC62.2 | 35 | |
| Visual Object Tracking | LaSOT 42 (test) | Success Rate73.4 | 34 | |
| Visual Tracking | AVisT | -- | 33 |