Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again

About

Referring Multi-Object Tracking (RMOT) aims to track multiple objects specified by natural language expressions in videos. With the recent significant progress of one-stage methods, the two-stage Referring-by-Tracking (RBT) paradigm has gradually lost its popularity. However, its lower training cost and flexible incremental deployment remain irreplaceable. Rethinking existing two-stage RBT frameworks, we identify two fundamental limitations: the overly heuristic feature construction and fragile correspondence modeling. To address these issues, we propose FlexHook, a novel two-stage RBT framework. In FlexHook, the proposed Conditioning Hook (C-Hook) redefines the feature construction by a sampling-based strategy and language-conditioned cue injection. Then, we introduce a Pairwise Correspondence Decoder (PCD) that replaces CLIP-based similarity matching with active correspondence modeling, yielding a more flexible and robust strategy. Extensive experiments on multiple benchmarks (Refer-KITTI/v2, Refer-Dance, and LaMOT) demonstrate that FlexHook becomes the first two-stage RBT approach to comprehensively outperform current state-of-the-art methods. Code can be found in the https://github.com/buptLwz/FlexHook.

Weize Li, Yunhao Du, Qixiang Yin, Zhicheng Zhao, Fei Su• 2025

Related benchmarks

TaskDatasetResultRank
Referring Multi-Object TrackingRefer-KITTI 37 (test)
HOTA53.83
11
Referring Multi-Object TrackingRefer-KITTI V2 44 (test)
HOTA42.53
11
Referring Multi-Object TrackingLaMOT
HOTA56.77
5
Referring Multi-Object TrackingRefer-Dance
HOTA32.17
3
Showing 4 of 4 rows

Other info

Follow for update