Unified Perceptual Parsing for Scene Understanding
About
Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes. Models are available at \url{https://github.com/CSAILVision/unifiedparsing}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU57 | 2731 | |
| Semantic segmentation | ADE20K | mIoU56.3 | 936 | |
| Semantic segmentation | Cityscapes (val) | -- | 572 | |
| Semantic segmentation | Cityscapes (val) | mIoU81.1 | 332 | |
| Semantic segmentation | Coco-Stuff (test) | mIoU43.4 | 184 | |
| Semantic segmentation | LoveDA | mIoU52.44 | 142 | |
| Semantic segmentation | Pascal Context | -- | 111 | |
| Video Semantic Segmentation | VSPW (val) | mIoU37.5 | 92 | |
| Semantic segmentation | LoveDA (test) | mIoU47.6 | 81 | |
| Semantic segmentation | ADE20K v1 (val) | -- | 76 |