Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

About

Exploiting a general-purpose neural architecture to replace hand-wired designs or inductive biases has recently drawn extensive interest. However, existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection, hindering the tracking development in a more general system. This paper presents a Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction. Unlike existing Siamese trackers, we serialize the input images and concatenate them directly before the one-branch backbone. Feature interaction in the backbone helps to remove well-designed interaction modules and produce a more efficient and effective framework. To reduce the information loss from down-sampling in vision transformers, we further propose a foveal window strategy, providing more diverse input patches with acceptable computational costs. Our SimTrack improves the baseline with 2.5%/2.6% AUC gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles.

Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, Wanli Ouyang• 2022

Related benchmarks

TaskDatasetResultRank
Visual Object TrackingTrackingNet (test)
Normalized Precision (Pnorm)87.4
460
Visual Object TrackingLaSOT (test)
AUC70.5
444
Visual Object TrackingGOT-10k (test)
Average Overlap69.8
378
Object TrackingLaSoT
AUC70.5
333
Object TrackingTrackingNet
Precision (P)86.5
225
Visual Object TrackingGOT-10k
AO78.9
223
Visual Object TrackingUAV123 (test)
AUC71.2
188
Visual Object TrackingUAV123
AUC0.712
165
Visual Object TrackingTNL2K
AUC55.6
95
Visual Object TrackingTNL2k (test)
AUC55.6
74
Showing 10 of 33 rows

Other info

Follow for update