STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
About
Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. In particular, we model a video clip as a single 3D spatio-temporal volume, and propose a novel approach that segments and tracks instances across space and time in a single stage. Our problem formulation is centered around the idea of spatio-temporal embeddings which are trained to cluster pixels belonging to a specific object instance over an entire video clip. To this end, we introduce (i) novel mixing functions that enhance the feature representation of spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that can reason about temporal context. Our network is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster these embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks. Code and models are available at https://github.com/sabarim/STEm-Seg.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | J mean61.5 | 1130 | |
| Video Instance Segmentation | YouTube-VIS 2019 (val) | AP35 | 567 | |
| Video Instance Segmentation | YouTube-VIS 2021 (val) | AP33.3 | 344 | |
| Video Instance Segmentation | OVIS (val) | AP13.8 | 301 | |
| Video Instance Segmentation | YouTube-VIS (val) | AP34.6 | 118 | |
| Unsupervised Video Object Segmentation | DAVIS 2016 (val) | F Mean80.6 | 108 | |
| Multi-Object Tracking and Segmentation | BDD100K segmentation tracking (val) | mMOTSA12.2 | 54 | |
| Multi-Object Tracking and Segmentation | KITTI MOTS (val) | sMOTSA (Car)72.7 | 18 | |
| Video Instance Segmentation | OVIS 2021 (val) | AP13.8 | 14 | |
| Salient region segmentation | DAVIS 2016 (val) | mIoU80.6 | 11 |