Learning Spatio-Temporal Transformer for Visual Tracking
About
In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running at real-time speed, being 6x faster than Siam R-CNN. Code and models are open-sourced at https://github.com/researchmm/Stark.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Object Tracking | TrackingNet (test) | Normalized Precision (Pnorm)86.9 | 460 | |
| Visual Object Tracking | LaSOT (test) | AUC67.1 | 444 | |
| Visual Object Tracking | GOT-10k (test) | Average Overlap71.5 | 378 | |
| Object Tracking | LaSoT | AUC67.1 | 333 | |
| RGB-T Tracking | LasHeR (test) | PR44.9 | 244 | |
| Object Tracking | TrackingNet | Precision (P)78.1 | 225 | |
| Visual Object Tracking | GOT-10k | AO78.1 | 223 | |
| RGB-T Tracking | RGBT234 (test) | Precision Rate79 | 189 | |
| Visual Object Tracking | UAV123 (test) | AUC69.2 | 188 | |
| RGB-D Object Tracking | VOT-RGBD 2022 (public challenge) | EAO64.7 | 167 |