Per-Pixel Classification is Not All You Need for Semantic Segmentation
About
Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU55.6 | 2731 | |
| Instance Segmentation | COCO 2017 (val) | -- | 1144 | |
| Semantic segmentation | ADE20K | mIoU46.7 | 936 | |
| Semantic segmentation | Cityscapes | mIoU78.5 | 578 | |
| Semantic segmentation | Cityscapes (val) | mIoU79.7 | 572 | |
| Semantic segmentation | Cityscapes (val) | mIoU80.3 | 332 | |
| Semantic segmentation | PASCAL Context (val) | mIoU53.7 | 323 | |
| Panoptic Segmentation | COCO (val) | PQ52.7 | 219 | |
| Semantic segmentation | COCO Stuff | mIoU41.9 | 195 | |
| Semantic segmentation | Coco-Stuff (test) | mIoU33.8 | 184 |