TAPNext: Tracking Any Point (TAP) as Next Token Prediction

About

Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training. The TAPNext model and code can be found at https://tap-next.github.io/.

Artem Zholus, Carl Doersch, Yi Yang, Skanda Koppula, Viorica Patraucean, Xu Owen He, Ignacio Rocco, Mehdi S. M. Sajjadi, Sarath Chandar, Ross Goroshin• 2025

Related benchmarks

Task	Dataset	Result
Point Tracking	TAP-Vid DAVIS (First)	Delta Avg (<c)76.6	76
Point Tracking	TAP-Vid Kinetics (First)	Avg Displacement Error (delta_avg)64.46	53
Point Tracking	DAVIS TAP-Vid	Average Jaccard (AJ)65.2	52
Point Tracking	TAP-Vid Kinetics	Overall Accuracy90.06	48
Point Tracking	TAP-Vid DAVIS (Strided)	Avg Delta Error79.7	33
Point Tracking	RoboTAP	AJ59.5	22
Point Tracking	TAP-Vid Kubric (subset of 30 videos)	AJ80.91	12
Point Tracking	EgoPoints	Average Displacement X31.8	10
Point Tracking	Dynamic Replica	Average Displacement Error46.2	9
Point Tracking	RoboTAP First	Average Jitter (AJ)59.8	8

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord