SwiftNet: Real-time Video Object Segmentation
About
In this work we present SwiftNet for real-time semisupervised video object segmentation (one-shot VOS), which reports 77.8% J &F and 70 FPS on DAVIS 2017 validation dataset, leading all present solutions in overall accuracy and speed performance. We achieve this by elaborately compressing spatiotemporal redundancy in matching-based VOS via Pixel-Adaptive Memory (PAM). Temporally, PAM adaptively triggers memory updates on frames where objects display noteworthy inter-frame variations. Spatially, PAM selectively performs memory update and match on dynamic pixels while ignoring the static ones, significantly reducing redundant computations wasted on segmentation-irrelevant pixels. To promote efficient reference encoding, light-aggregation encoder is also introduced in SwiftNet deploying reversed sub-pixel. We hope SwiftNet could set a strong and efficient baseline for real-time VOS and facilitate its application in mobile vision. The source code of SwiftNet can be found at https://github.com/haochenheheda/SwiftNet.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | J mean78.3 | 1130 | |
| Video Object Segmentation | DAVIS 2016 (val) | J Mean90.5 | 564 | |
| Video Object Segmentation | YouTube-VOS 2018 (val) | J Score (Seen)77.8 | 493 | |
| Video Object Segmentation | YouTube-VOS 2019 (val) | J-Score (Seen)77.8 | 231 | |
| Semi-supervised Video Object Segmentation | DAVIS 2017 (val) | J&F Score81.1 | 31 | |
| Semi-supervised Video Object Segmentation | DAVIS 2016 (val) | Input J Score90.5 | 19 |