Bootstrapping Referring Multi-Object Tracking

About

Referring understanding is a fundamental task that bridges natural language and visual content by localizing objects described in free-form expressions. However, existing works are constrained by limited language expressiveness, lacking the capacity to model object dynamics in spatial numbers and temporal states. To address these limitations, we introduce a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking, comprehensively accounting for variations in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT benchmark named Refer-KITTI-V2, featuring scalable and diverse language expressions. To efficiently generate high-quality annotations covering object dynamics with minimal manual effort, we propose a semi-automatic labeling pipeline that formulates a total of 9,758 language prompts. In addition, we propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT. At its core is a query-driven Temporal Enhancement Module that represents each object as a Transformer query, enabling long-term spatial-temporal interactions with other objects and past frames to efficiently refine these queries. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source code and dataset is available at https://github.com/zyn213/TempRMOT.

Yani Zhang, Dongming Wu, Wencheng Han, Xingping Dong• 2024

Related benchmarks

Task	Dataset	Result
Referring Multi-Object Tracking	Refer-KITTI	HOTA68.7	19
Referring Multi-Object Tracking	Refer-KITTI 37 (test)	HOTA52.21	11
Referring Multi-Object Tracking	Refer-KITTI V2 44 (test)	HOTA35.04	11
Referring Multi-Object Tracking	Refer-KITTI V2	HOTA35.04	8
Referring Multi-Object Tracking	ReaMOT Challenge Low-Level Perception	RHOTA12.95	7
Referring Multi-Object Tracking	ReaMOT Challenge High-Level Reasoning	RHOTA6.58	7
Referring Multi-Object Tracking	MeViS v2	HOTA*30	4
RGBD Referring Multi-Object Tracking	DRSet (test)	HOTA237	4
Referring Multi-Object Tracking	ORSet (test)	HOTA2	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord