Single-Stage Semantic Segmentation from Image Labels
About
Recent years have seen a rapid growth in new approaches improving the accuracy of semantic segmentation in a weakly supervised setting, i.e. with only image-level labels available for training. However, this has come at the cost of increased model complexity and sophisticated multi-stage training procedures. This is in contrast to earlier work that used only a single stage $-$ training one segmentation network on image labels $-$ which was abandoned due to inferior segmentation accuracy. In this work, we first define three desirable properties of a weakly supervised method: local consistency, semantic fidelity, and completeness. Using these properties as guidelines, we then develop a segmentation-based network model and a self-supervised training scheme to train for semantic masks from image-level annotations in a single stage. We show that despite its simplicity, our method achieves results that are competitive with significantly more complex pipelines, substantially outperforming earlier single-stage methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | PASCAL VOC 2012 (val) | Mean IoU65.7 | 2040 | |
| Semantic segmentation | PASCAL VOC 2012 (test) | mIoU66.6 | 1342 | |
| Semantic segmentation | PASCAL VOC (val) | mIoU62.7 | 338 | |
| Semantic segmentation | Cityscapes (val) | mIoU11.8 | 287 | |
| Semantic segmentation | Pascal VOC (test) | mIoU64.3 | 236 | |
| Weakly supervised semantic segmentation | PASCAL VOC 2012 (test) | mIoU64.3 | 158 | |
| Weakly supervised semantic segmentation | PASCAL VOC 2012 (val) | mIoU65.3 | 154 | |
| Semantic segmentation | PASCAL VOC 2012 (train) | mIoU66.9 | 73 | |
| Weakly supervised semantic segmentation | PASCAL VOC 2012 (train) | mIoU (Mask)66.9 | 53 | |
| Incremental Semantic Segmentation | Pascal VOC disjoint setup 2012 (VOC 10-1) | mIoU (0-10)60.7 | 30 |