Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TAPTR: Tracking Any Point with Transformers as Detection

About

In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformers (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, in each video frame, each tracking point is represented as a point query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to provide long temporal information while mitigating the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.

Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Lei Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Point TrackingDAVIS TAP-Vid
Average Jaccard (AJ)63
41
Point TrackingDAVIS
AJ63
38
Point TrackingTAP-Vid Kinetics
Overall Accuracy85.2
37
Point TrackingTAP-Vid-Kinetics (val)
Average Displacement Error64.4
25
Point TrackingDAVIS TAP-Vid (val)
AJ63
19
Point TrackingTAP-Vid DAVIS (First)
Delta Avg (<c)76.1
19
Point TrackingTAP-Vid DAVIS (Strided)
Avg Delta Error79.2
17
Point TrackingRGB-Stacking
Average Delta76.2
13
Point TrackingRoboTAP
delta_avg64.4
12
Point TrackingTAP-Vid-DAVIS ImageNet-C Corruptions 11
Gaussian Blur Score64.1
7
Showing 10 of 12 rows

Other info

Follow for update