LLMTrack: Semantic Multi-Object Tracking with Multi-modal Large Language Models

About

Multi-Object Tracking (MOT) is evolving from geometric localization to Semantic MOT (SMOT) to answer complex relational queries, yet progress is hindered by semantic data scarcity and a structural disconnect between tracking architectures and Multi-modal Large Language Models (MLLMs). To address this, we introduce Grand-SMOT, a large-scale, open-world benchmark providing high-density, dual-stream narratives that comprehensively decouple individual behaviors from environmental contexts. Furthermore, we propose LLMTrack, the first framework to seamlessly integrate MLLMs into the SMOT task. LLMTrack establishes a Macro-Understanding-First paradigm, utilizing a novel Spatio-Temporal Fusion Module to align discrete geometric trajectories with continuous semantic features, effectively suppressing temporal hallucinations during online processing. Extensive experiments demonstrate that LLMTrack achieves state-of-the-art geometric tracking performance while delivering a qualitative leap in dynamic semantic reasoning. Notably, our analysis reveals that high-quality semantic narratives empower the language model to deduce complex social interactions naturally, demonstrating that direct cognitive reasoning is more effective than cumbersome explicit visual modeling. Ultimately, our contributions bridge the gap between perceptual tracking and cognitive reasoning, establishing a robust new foundation for comprehensive video understanding and intelligent narrative generation.

Pan Liao, Feng Yang, Di Wu, Jinwen Yu, Yuhua Zhu, Wenhui Zhao, Dingwen Zhang• 2026

Related benchmarks

Task	Dataset	Result
Multi-Object Tracking	BenSMOT	HOTA74.61	9
Instance Description	BenSMOT (test)	B-40.525	6
Interaction Recognition	BenSMOT (test)	Precision54.3	6
Video Summary	BenSMOT (test)	B-40.414	6

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord