Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards Unified Token Learning for Vision-Language Tracking

About

In this paper, we present a simple, flexible and effective vision-language (VL) tracking pipeline, termed \textbf{MMTrack}, which casts VL tracking as a token generation task. Traditional paradigms address VL tracking task indirectly with sophisticated prior designs, making them over-specialize on the features of specific architectures or mechanisms. In contrast, our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target in an auto-regressive manner. The design without other prior modules avoids multiple sub-tasks learning and hand-designed loss functions, significantly reducing the complexity of VL tracking modeling and allowing our tracker to use a simple cross-entropy loss as unified optimization objective for VL tracking task. Extensive experiments on TNL2K, LaSOT, LaSOT$_{\rm{ext}}$ and OTB99-Lang benchmarks show that our approach achieves promising results, compared to other state-of-the-arts.

Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, Xianxian Li• 2023

Related benchmarks

TaskDatasetResultRank
Object TrackingLaSoT
AUC70
498
Visual Object TrackingLaSOT (test)
AUC70
470
Visual Object TrackingTNL2K
AUC58.6
169
Vision-Language TrackingOTB 99
AUC70.5
83
Vision-Language TrackingTNL2k (test)
AUC58.6
49
TrackingOTB99
AUC0.705
45
Vision-Language TrackingTNLLT latest (test)
SR55.8
20
Vision-Language TrackingLaSOT ext
AUC0.494
18
Visual Object TrackingLaSOText ext (test)
AUC49.4
14
Showing 9 of 9 rows

Other info

Follow for update