ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe

About

We present ARTrackV2, which integrates two pivotal aspects of tracking: determining where to look (localization) and how to describe (appearance analysis) the target object across video frames. Building on the foundation of its predecessor, ARTrackV2 extends the concept by introducing a unified generative framework to "read out" object's trajectory and "retell" its appearance in an autoregressive manner. This approach fosters a time-continuous methodology that models the joint evolution of motion and visual features, guided by previous estimates. Furthermore, ARTrackV2 stands out for its efficiency and simplicity, obviating the less efficient intra-frame autoregression and hand-tuned parameters for appearance updates. Despite its simplicity, ARTrackV2 achieves state-of-the-art performance on prevailing benchmark datasets while demonstrating remarkable efficiency improvement. In particular, ARTrackV2 achieves AO score of 79.5\% on GOT-10k, and AUC of 86.1\% on TrackingNet while being $3.6 \times$ faster than ARTrack. The code will be released.

Yifan Bai, Zeyang Zhao, Yihong Gong, Xing Wei• 2023

Related benchmarks

Task	Dataset	Result
Visual Object Tracking	TrackingNet (test)	Normalized Precision (Pnorm)90.4	502
Object Tracking	LaSoT	AUC73.6	498
Visual Object Tracking	LaSOT (test)	AUC73.6	470
Visual Object Tracking	GOT-10k (test)	Average Overlap79.5	450
Object Tracking	TrackingNet	Precision (P)86.2	327
Visual Object Tracking	GOT-10k	AO79.5	306
Visual Object Tracking	UAV123	AUC0.717	193
Visual Object Tracking	UAV123 (test)	AUC69.9	188
Visual Object Tracking	TNL2K	AUC61.6	169
Visual Object Tracking	LaSoText	AUC53.4	140

Showing 10 of 33 rows

Other info

Code

Follow for update

@wizwand_team Discord