Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories
About
Tracking pixels in videos is typically studied as an optical flow estimation problem, where every pixel is described with a displacement vector that locates it in the next frame. Even though wider temporal context is freely available, prior efforts to take this into account have yielded only small gains over 2-frame methods. In this paper, we revisit Sand and Teller's "particle video" approach, and study pixel tracking as a long-range motion estimation problem, where every pixel is described with a trajectory that locates it in multiple future frames. We re-build this classic approach using components that drive the current state-of-the-art in flow and object tracking, such as dense cost maps, iterative optimization, and learned appearance updates. We train our models using long-range amodal point trajectories mined from existing optical flow data that we synthetically augment with multi-frame occlusions. We test our approach in trajectory estimation benchmarks and in keypoint label propagation tasks, and compare favorably against state-of-the-art optical flow and feature tracking methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Point Tracking | DAVIS TAP-Vid | Average Jaccard (AJ)42 | 41 | |
| Point Tracking | DAVIS | AJ42 | 38 | |
| Point Tracking | TAP-Vid Kinetics | Overall Accuracy77.1 | 37 | |
| Point Tracking | TAP-Vid RGB-Stacking (test) | AJ15.7 | 32 | |
| Point Tracking | TAP-Vid DAVIS (test) | AJ42 | 31 | |
| Point Tracking | TAP-Vid Kinetics (test) | Average Jitter (AJ)35.3 | 30 | |
| Point Tracking | Kinetics | delta_avg54.8 | 24 | |
| Point Tracking | TAP-Vid DAVIS (First) | Delta Avg (<c)64.8 | 19 | |
| Point Tracking | DAVIS TAP-Vid (val) | AJ42 | 19 | |
| Point Tracking | Kubric | AJ59.1 | 18 |