Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TrackVLA: Embodied Visual Tracking in the Wild

About

Embodied visual tracking is a fundamental skill in Embodied AI, enabling an agent to follow a specific target in dynamic environments using only egocentric vision. This task is inherently challenging as it requires both accurate target recognition and effective trajectory planning under conditions of severe occlusion and high scene dynamics. Existing approaches typically address this challenge through a modular separation of recognition and planning. In this work, we propose TrackVLA, a Vision-Language-Action (VLA) model that learns the synergy between object recognition and trajectory planning. Leveraging a shared LLM backbone, we employ a language modeling head for recognition and an anchor-based diffusion model for trajectory planning. To train TrackVLA, we construct an Embodied Visual Tracking Benchmark (EVT-Bench) and collect diverse difficulty levels of recognition samples, resulting in a dataset of 1.7 million samples. Through extensive experiments in both synthetic and real-world environments, TrackVLA demonstrates SOTA performance and strong generalizability. It significantly outperforms existing methods on public benchmarks in a zero-shot manner while remaining robust to high dynamics and occlusion in real-world scenarios at 10 FPS inference speed. Our project page is: https://pku-epic.github.io/TrackVLA-web.

Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, He Wang• 2025

Related benchmarks

TaskDatasetResultRank
Visual Active TrackingUnrealCV Parking Lot scene
EL467
21
Embodied Visual TrackingSimpleRoom Unseen Virtual Environment
EL396
16
Embodied Visual TrackingUrbanCity Unseen Virtual Environment
EL340
16
Embodied Visual TrackingEVT-Bench Single Target Tracking
SR85.1
11
Embodied Visual TrackingEVT-Bench Distracted Tracking
SR57.6
11
Visual Active TrackingUnrealCV UrbanRoad scene
EL500
11
Visual Active TrackingUnrealCV Snow Village scene
EL500
11
Visual Active TrackingUnrealCV
EL500
11
Embodied Visual TrackingEVT-Bench
ST Success Rate (SR)85.1
10
Visual Active TrackingUnrealCV UrbanCity 4D
EL476
10
Showing 10 of 18 rows

Other info

Follow for update