Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

About

In video object tracking, there exist rich temporal contexts among successive frames, which have been largely overlooked in existing trackers. In this work, we bridge the individual video frames and explore the temporal contexts across them via a transformer architecture for robust object tracking. Different from classic usage of the transformer in natural language processing tasks, we separate its encoder and decoder into two parallel branches and carefully design them within the Siamese-like tracking pipelines. The transformer encoder promotes the target templates via attention-based feature reinforcement, which benefits the high-quality tracking model generation. The transformer decoder propagates the tracking cues from previous templates to the current frame, which facilitates the object searching process. Our transformer-assisted tracking framework is neat and trained in an end-to-end manner. With the proposed transformer, a simple Siamese matching approach is able to outperform the current top-performing trackers. By combining our transformer with the recent discriminative tracking pipeline, our method sets several new state-of-the-art records on prevalent tracking benchmarks.

Ning Wang, Wengang Zhou, Jie Wang, Houqaing Li• 2021

Related benchmarks

Task	Dataset	Result
Visual Object Tracking	TrackingNet (test)	Normalized Precision (Pnorm)83.5	502
Object Tracking	LaSoT	AUC63.9	498
Visual Object Tracking	LaSOT (test)	AUC66.5	470
Visual Object Tracking	GOT-10k (test)	Average Overlap68.8	450
Object Tracking	TrackingNet	Precision (P)73.1	327
Visual Object Tracking	GOT-10k	AO68.8	306
Visual Object Tracking	UAV123	AUC0.675	193
Visual Object Tracking	UAV123 (test)	AUC67.5	188
Visual Object Tracking	TNL2K	AUC52.3	169
Visual Object Tracking	OTB-100	AUC71.1	154

Showing 10 of 43 rows

Other info

Code

Follow for update

@wizwand_team Discord