Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking
About
Exploiting a general-purpose neural architecture to replace hand-wired designs or inductive biases has recently drawn extensive interest. However, existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection, hindering the tracking development in a more general system. This paper presents a Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction. Unlike existing Siamese trackers, we serialize the input images and concatenate them directly before the one-branch backbone. Feature interaction in the backbone helps to remove well-designed interaction modules and produce a more efficient and effective framework. To reduce the information loss from down-sampling in vision transformers, we further propose a foveal window strategy, providing more diverse input patches with acceptable computational costs. Our SimTrack improves the baseline with 2.5%/2.6% AUC gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Object Tracking | TrackingNet (test) | Normalized Precision (Pnorm)87.4 | 460 | |
| Visual Object Tracking | LaSOT (test) | AUC70.5 | 444 | |
| Visual Object Tracking | GOT-10k (test) | Average Overlap69.8 | 378 | |
| Object Tracking | LaSoT | AUC70.5 | 333 | |
| Object Tracking | TrackingNet | Precision (P)86.5 | 225 | |
| Visual Object Tracking | GOT-10k | AO78.9 | 223 | |
| Visual Object Tracking | UAV123 (test) | AUC71.2 | 188 | |
| Visual Object Tracking | UAV123 | AUC0.712 | 165 | |
| Visual Object Tracking | TNL2K | AUC55.6 | 95 | |
| Visual Object Tracking | TNL2k (test) | AUC55.6 | 74 |