Efficient Visual Tracking via Hierarchical Cross-Attention Transformer

About

In recent years, target tracking has made great progress in accuracy. This development is mainly attributed to powerful networks (such as transformers) and additional modules (such as online update and refinement modules). However, less attention has been paid to tracking speed. Most state-of-the-art trackers are satisfied with the real-time speed on powerful GPUs. However, practical applications necessitate higher requirements for tracking speed, especially when edge platforms with limited resources are used. In this work, we present an efficient tracking method via a hierarchical cross-attention transformer named HCAT. Our model runs about 195 fps on GPU, 45 fps on CPU, and 55 fps on the edge AI platform of NVidia Jetson AGX Xavier. Experiments show that our HCAT achieves promising results on LaSOT, GOT-10k, TrackingNet, NFS, OTB100, UAV123, and VOT2020. Code and models are available at https://github.com/chenxin-dlut/HCAT.

Xin Chen, Ben Kang, Dong Wang, Dongdong Li, Huchuan Lu• 2022

Related benchmarks

Task	Dataset	Result
Object Tracking	LaSoT	AUC59.3	519
Visual Object Tracking	GOT-10k	AO65.1	357
Object Tracking	TrackingNet	Precision (P)72.9	327
Visual Object Tracking	UAV123 (test)	--	188
Visual Object Tracking	LaSoText	AUC40.6	140
Visual Object Tracking	OTB100 (test)	Success Rate (IoU>0.50)68.1	52
Visual Object Tracking	UAV123	SUC62.7	48
Visual Tracking	NfS (test)	AUC61.9	45
Visual Object Tracking	AVisT (test)	AUC41.8	35
Visual Object Tracking	LaSOT 42 (test)	Success Rate59.3	34

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord