Hierarchical Memory Matching Network for Video Object Segmentation
About
We present Hierarchical Memory Matching Network (HMMN) for semi-supervised video object segmentation. Based on a recent memory-based method [33], we propose two advanced memory read modules that enable us to perform memory reading in multiple scales while exploiting temporal smoothness. We first propose a kernel guided memory matching module that replaces the non-local dense memory read, commonly adopted in previous memory-based methods. The module imposes the temporal smoothness constraint in the memory read, leading to accurate memory retrieval. More importantly, we introduce a hierarchical memory matching scheme and propose a top-k guided memory matching module in which memory read on a fine-scale is guided by that on a coarse-scale. With the module, we perform memory read in multiple scales efficiently and leverage both high-level semantic and low-level fine-grained memory features to predict detailed object masks. Our network achieves state-of-the-art performance on the validation sets of DAVIS 2016/2017 (90.8% and 84.7%) and YouTube-VOS 2018/2019 (82.6% and 82.5%), and test-dev set of DAVIS 2017 (78.6%). The source code and model are available online: https://github.com/Hongje/HMMN.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | J mean81.9 | 1130 | |
| Video Object Segmentation | DAVIS 2016 (val) | J Mean89.6 | 564 | |
| Video Object Segmentation | YouTube-VOS 2018 (val) | J Score (Seen)82.1 | 493 | |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Region J Mean74.7 | 237 | |
| Video Object Segmentation | YouTube-VOS 2019 (val) | J-Score (Seen)81.7 | 231 | |
| Video Object Segmentation | DAVIS 2017 (test) | J (Jaccard Index)74.7 | 107 | |
| Mask Prediction | Youtube-VOS | BCE Loss1.567 | 5 | |
| Mask Prediction | DAVIS | BCE Loss3.738 | 5 |