RPT: Learning Point Set Representation for Siamese Visual Tracking
About
While remarkable progress has been made in robust visual tracking, accurate target state estimation still remains a highly challenging problem. In this paper, we argue that this issue is closely related to the prevalent bounding box representation, which provides only a coarse spatial extent of object. Thus an effcient visual tracking framework is proposed to accurately estimate the target state with a finer representation as a set of representative points. The point set is trained to indicate the semantically and geometrically significant positions of target region, enabling more fine-grained localization and modeling of object appearance. We further propose a multi-level aggregation strategy to obtain detailed structure information by fusing hierarchical convolution layers. Extensive experiments on several challenging benchmarks including OTB2015, VOT2018, VOT2019 and GOT-10k demonstrate that our method achieves new state-of-the-art performance while running at over 20 FPS.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Object Tracking | GOT-10k (test) | Average Overlap62.4 | 378 | |
| Visual Object Tracking | VOT 2020 (test) | EAO0.53 | 147 | |
| Visual Object Tracking | VOT 2019 (test) | Accuracy (A)0.623 | 51 | |
| Visual Object Tracking | OTB 2015 (test) | AUC Score71.5 | 47 | |
| Visual Tracking | VOT 2018 | EAO0.51 | 9 |