Context-Aware Integration of Language and Visual References for Natural Language Tracking

About

Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources, which suffer from tracking drift when language and visual templates miss-align with the dynamic target state and ambiguity in the later merging stage. To tackle the issues, we propose a joint multi-modal tracking framework with 1) a prompt modulation module to leverage the complementarity between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues, and 2) a unified target decoding module to integrate the multi-modal reference cues and executes the integrated queries on the search image to predict the target location in an end-to-end manner directly. This design ensures spatio-temporal consistency by leveraging historical visual information and introduces an integrated solution, generating predictions in a single step. Extensive experiments conducted on TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed approach. The results demonstrate competitive performance against state-of-the-art methods for both tracking and grounding.

Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, Jiming Chen• 2024

Related benchmarks

Task	Dataset	Result
Object Tracking	LaSoT	AUC59.9	498
Vision-Language Tracking	OTB 99	AUC66.7	83
Vision-Language Tracking	TNL2K	AUC57.8	25
Natural Language Tracking	TNL-2K	AUC57.8	19
Natural Language Tracking	OTB Lang	AUC66.7	17
Visual Grounding	ReferCOCOg Google (val)	--	16
Visual Grounding	RefCOCOg UMD (val)	--	8
Visual Grounding	RefCOCOg UMD (test-u)	Average IoU73.2	4

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord