Towards Unified Token Learning for Vision-Language Tracking

About

In this paper, we present a simple, flexible and effective vision-language (VL) tracking pipeline, termed \textbf{MMTrack}, which casts VL tracking as a token generation task. Traditional paradigms address VL tracking task indirectly with sophisticated prior designs, making them over-specialize on the features of specific architectures or mechanisms. In contrast, our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target in an auto-regressive manner. The design without other prior modules avoids multiple sub-tasks learning and hand-designed loss functions, significantly reducing the complexity of VL tracking modeling and allowing our tracker to use a simple cross-entropy loss as unified optimization objective for VL tracking task. Extensive experiments on TNL2K, LaSOT, LaSOT$_{\rm{ext}}$ and OTB99-Lang benchmarks show that our approach achieves promising results, compared to other state-of-the-arts.

Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, Xianxian Li• 2023

Related benchmarks

Task	Dataset	Result
Object Tracking	LaSoT	AUC70	498
Visual Object Tracking	LaSOT (test)	AUC70	470
Visual Object Tracking	TNL2K	AUC58.6	169
Vision-Language Tracking	OTB 99	AUC70.5	83
Vision-Language Tracking	TNL2k (test)	AUC58.6	49
Tracking	OTB99	AUC0.705	45
Vision-Language Tracking	TNLLT latest (test)	SR55.8	20
Vision-Language Tracking	LaSOT ext	AUC0.494	18
Visual Object Tracking	LaSOText ext (test)	AUC49.4	14

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord