Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Instance-level Visual Active Tracking with Occlusion-Aware Planning

About

Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.

Haowei Sun, Kai Zhou, Hao Gao, Shiteng Zhang, Jinwu Hu, Xutao Wen, Qixiang Ye, Mingkui Tan• 2026

Related benchmarks

TaskDatasetResultRank
Visual Active TrackingUnrealCV Parking Lot scene
EL482
21
Embodied Visual TrackingSimpleRoom Unseen Virtual Environment
EL500
16
Embodied Visual TrackingUrbanCity Unseen Virtual Environment
EL500
16
Visual Active TrackingUnrealCV UrbanRoad scene
EL500
11
Visual Active TrackingUnrealCV Snow Village scene
EL500
11
Visual Active TrackingUnrealCV
EL500
11
Visual Active TrackingUnrealCV UrbanCity 4D
EL486
10
Visual Active TrackingUnrealCV ComplexRoom 4D
EL481
10
Visual Active TrackingUnrealCV Average - Distractor Environments
EL483
10
Action PredictionVOT 2021 (8 selected videos)
Average Correct Action Rate87.9
6
Showing 10 of 18 rows

Other info

Follow for update