Learning What to Learn for Video Object Segmentation
About
Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined during inference with a given first-frame reference mask. The problem of how to capture and utilize this limited target information remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learning module. This internal learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond standard few-shot learning techniques by learning what the few-shot learner should learn. This allows us to achieve a rich internal representation of the target in the current frame, significantly increasing the segmentation accuracy of our approach. We perform extensive experiments on multiple benchmarks. Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a 2.6% relative improvement over the previous best result.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | J mean79.1 | 1130 | |
| Video Object Segmentation | YouTube-VOS 2018 (val) | J Score (Seen)80.4 | 493 | |
| Visual Object Tracking | TrackingNet (test) | Normalized Precision (Pnorm)84.4 | 460 | |
| Visual Object Tracking | LaSOT (test) | AUC59.7 | 444 | |
| Video Object Segmentation | YouTube-VOS 2019 (val) | J-Score (Seen)79.6 | 231 | |
| Visual Object Tracking | VOT 2020 (test) | EAO0.472 | 147 | |
| Visual Tracking | UAV123 | AUC59.7 | 41 | |
| Video Object Segmentation | LVOS v2 (val) | J&F60.6 | 41 | |
| Visual Object Tracking | VOT ST 2020 | Robustness0.798 | 23 | |
| Visual Object Tracking | VOT 2022 | EAO51.6 | 14 |