TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

About

Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, Kui Wu, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, Zhizheng Zhang, He Wang• 2025

Related benchmarks

Task	Dataset	Result
Embodied Visual Tracking	EVT-Bench Distracted Tracking	SR66.5	11
Active visual tracking	EVT-Bench Single Target single-view	TR81	11
Embodied Visual Tracking	EVT-Bench Single Target Tracking	SR86	11
Embodied Visual Tracking	EVT-Bench	ST Success Rate (SR)90.9	10
Person-Following	EVT-Bench Single-Target Tracking (STT) single view	SR86	9
Person-Following	EVT-Bench single view (Distracted Tracking)	SR66.5	9
Person-Following	EVT-Bench Ambiguity Tracking (AT) single view	Success Rate (SR)51.2	8

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord