Cut and Learn for Unsupervised Object Detection and Instance Segmentation
About
We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models. We leverage the property of self-supervised models to 'discover' objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in an image and then learns a detector on these masks using our robust loss function. We further improve the performance by self-training the model on its predictions. Compared to prior work, CutLER is simpler, compatible with different detection architectures, and detects multiple objects. CutLER is also a zero-shot unsupervised detector and improves detection performance AP50 by over 2.7 times on 11 benchmarks across domains like video frames, paintings, sketches, etc. With finetuning, CutLER serves as a low-shot detector surpassing MoCo-v2 by 7.3% APbox and 6.6% APmask on COCO when training with 5% labels.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU35.7 | 2731 | |
| Object Detection | COCO 2017 (val) | -- | 2454 | |
| Instance Segmentation | COCO 2017 (val) | -- | 1144 | |
| Semantic segmentation | ADE20K | mIoU35.7 | 936 | |
| Semantic segmentation | Cityscapes | mIoU18.7 | 578 | |
| Semantic segmentation | Cityscapes (val) | mIoU18.7 | 572 | |
| Video Instance Segmentation | YouTube-VIS 2019 (val) | AP16 | 567 | |
| Semantic segmentation | PASCAL VOC (val) | mIoU53.8 | 338 | |
| Semantic segmentation | PASCAL Context (val) | mIoU43.4 | 323 | |
| 3D Instance Segmentation | ScanNet V2 (val) | Average AP500.2 | 195 |