Learning an Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking
About
Transformer-based models have improved visual tracking, but most still cannot run in real time on resource-limited devices, especially for unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we propose AVTrack, an adaptive computation tracking framework that adaptively activates transformer blocks through an Activation Module (AM), which dynamically optimizes the ViT architecture by selectively engaging relevant components. To address extreme viewpoint variations, we propose to learn view-invariant representations via mutual information (MI) maximization. In addition, we propose AVTrack-MD, an enhanced tracker incorporating a novel MI maximization-based multi-teacher knowledge distillation framework. Leveraging multiple off-the-shelf AVTrack models as teachers, we maximize the MI between their aggregated softened features and the corresponding softened feature of the student model, improving the generalization and performance of the student, especially under noisy conditions. Extensive experiments show that AVTrack-MD achieves performance comparable to AVTrack's performance while reducing model complexity and boosting average tracking speed by over 17\%. Codes is available at: https://github.com/wuyou3474/AVTrack.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Object Tracking | UAV123 (test) | -- | 188 | |
| Visual Tracking | UAV123 | -- | 41 | |
| Visual Object Tracking | DTB70 (test) | AUC65 | 19 | |
| Visual Object Tracking | UAVTrack112 L (test) | AUC (%)62.7 | 19 | |
| Visual Object Tracking | UAVDT (test) | AUC58.7 | 19 | |
| Visual Object Tracking | UAVTrack112 (test) | AUC65.4 | 19 | |
| UAV Tracking | DTB70 (test) | Precision84.3 | 15 | |
| UAV Tracking | Average (DTB70, UAVDT, UAV123) | AP83.7 | 15 | |
| UAV Tracking | DTB70 | Precision0.843 | 15 | |
| UAV Tracking | VisDrone 2018 | Precision86 | 15 |