Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

About

Exploiting a general-purpose neural architecture to replace hand-wired designs or inductive biases has recently drawn extensive interest. However, existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection, hindering the tracking development in a more general system. This paper presents a Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction. Unlike existing Siamese trackers, we serialize the input images and concatenate them directly before the one-branch backbone. Feature interaction in the backbone helps to remove well-designed interaction modules and produce a more efficient and effective framework. To reduce the information loss from down-sampling in vision transformers, we further propose a foveal window strategy, providing more diverse input patches with acceptable computational costs. Our SimTrack improves the baseline with 2.5%/2.6% AUC gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles.

Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, Wanli Ouyang• 2022

Related benchmarks

Task	Dataset	Result
Visual Object Tracking	TrackingNet (test)	Normalized Precision (Pnorm)87.4	502
Object Tracking	LaSoT	AUC70.5	498
Visual Object Tracking	LaSOT (test)	AUC70.5	470
Visual Object Tracking	GOT-10k (test)	Average Overlap69.8	450
Object Tracking	TrackingNet	Precision (P)86.5	327
Visual Object Tracking	GOT-10k	AO78.9	306
Visual Object Tracking	UAV123	AUC0.712	193
Visual Object Tracking	UAV123 (test)	AUC71.2	188
Visual Object Tracking	TNL2K	AUC55.6	169
Visual Object Tracking	TNL2k (test)	AUC55.6	92

Showing 10 of 45 rows

Other info

Follow for update

@wizwand_team Discord