Simple Cues Lead to a Strong Multi-Object Tracker
About
For a long time, the most common paradigm in Multi-Object Tracking was tracking-by-detection (TbD), where objects are first detected and then associated over video frames. For association, most models resourced to motion and appearance cues, e.g., re-identification networks. Recent approaches based on attention propose to learn the cues in a data-driven manner, showing impressive results. In this paper, we ask ourselves whether simple good old TbD methods are also capable of achieving the performance of end-to-end models. To this end, we propose two key ingredients that allow a standard re-identification network to excel at appearance-based tracking. We extensively analyse its failure cases, and show that a combination of our appearance features with a simple motion model leads to strong tracking results. Our tracker generalizes to four public datasets, namely MOT17, MOT20, BDD100k, and DanceTrack, achieving state-of-the-art performance. https://github.com/dvl-tum/GHOST.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multiple Object Tracking | MOT17 (test) | MOTA78.9 | 921 | |
| Multiple Object Tracking | MOT20 (test) | MOTA73.7 | 358 | |
| Multi-Object Tracking | DanceTrack (test) | HOTA0.567 | 355 | |
| Multi-Object Tracking | BDD100K (val) | mIDF155.6 | 70 | |
| Multi-Object Tracking | MOT17 | MOTA78.7 | 55 | |
| Multi-Object Tracking | MOT 2020 (test) | MOTA73.7 | 44 | |
| Multi-Object Tracking | BDD100K (test) | Mean IDF157 | 36 | |
| Multi-Object Tracking | MOT 2017 (test) | MOTA78.7 | 34 | |
| Multiple Object Tracking | MOT20 | MOTA73.7 | 21 |