Putting the Object Back into Video Object Segmentation
About
We present Cutie, a video object segmentation (VOS) network with object-level memory reading, which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise, especially in the presence of distractors, resulting in lower performance in more challenging data. In contrast, Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those, it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt, hence Cutie). The object queries act as a high-level summary of the target object, while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention, Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: https://hkchengrex.github.io/Cutie
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | J mean85.6 | 1130 | |
| Video Object Segmentation | YouTube-VOS 2019 (val) | J-Score (Seen)86.8 | 231 | |
| Video Object Segmentation | SA-V (val) | J&F Score61.3 | 74 | |
| Video Object Segmentation | SA-V (test) | J&F62.8 | 70 | |
| Video Object Segmentation | MOSE (val) | J&F Score71.7 | 45 | |
| Video Object Segmentation | LVOS v2 (val) | J&F92.2 | 41 | |
| Semi-supervised Video Object Segmentation | DAVIS 2017 (val) | J&F Score88.1 | 31 | |
| Video Object Segmentation | MOSE | J&F Score68.3 | 29 | |
| Video Object Segmentation | 17 video datasets (EndoVis 2018, ESD, LVOSv2, LV-VIS, UVO, VOST, PUMaVOS, Virtual KITTI 2, VIPSeg, Wildfires, VISOR, FBMS, Ego-Exo4D, Cityscapes, Lindenthal Camera, HT1080WT Cells, and Drosophila Heart) zero-shot | Zero-shot J&F Accuracy74.1 | 25 | |
| Video Object Segmentation | Hardware Efficiency Benchmark | FPS65 | 21 |