Decoupling Zero-Shot Semantic Segmentation
About
Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. Existing works formulate ZS3 as a pixel-level zeroshot classification problem, and transfer semantic knowledge from seen classes to unseen ones with the help of language models pre-trained only with texts. While simple, the pixel-level ZS3 formulation shows the limited capability to integrate vision-language models that are often pre-trained with image-text pairs and currently demonstrate great potential for vision tasks. Inspired by the observation that humans often perform segment-level semantic labeling, we propose to decouple the ZS3 into two sub-tasks: 1) a classagnostic grouping task to group the pixels into segments. 2) a zero-shot classification task on segments. The former task does not involve category information and can be directly transferred to group pixels for unseen classes. The latter task performs at segment-level and provides a natural way to leverage large-scale vision-language models pre-trained with image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we propose a simple and effective zero-shot semantic segmentation model, called ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by large margins, e.g., 22 points on the PASCAL VOC and 3 points on the COCO-Stuff in terms of mIoU for unseen classes. Code will be released at https://github.com/dingjiansw101/ZegFormer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU16.4 | 2731 | |
| Semantic segmentation | PASCAL VOC 2012 (val) | -- | 2040 | |
| Semantic segmentation | PASCAL VOC 2012 (test) | mIoU85.6 | 1342 | |
| Semantic segmentation | PASCAL VOC (val) | mIoU89.5 | 338 | |
| Semantic segmentation | PASCAL Context (val) | mIoU30.7 | 323 | |
| Semantic segmentation | Pascal VOC (test) | mIoU90.1 | 236 | |
| Semantic segmentation | COCO Stuff | mIoU34.8 | 195 | |
| Semantic segmentation | ADE20K A-150 | mIoU18 | 188 | |
| Semantic segmentation | Coco-Stuff (test) | -- | 184 | |
| Semantic segmentation | Pascal Context (test) | mIoU36.1 | 176 |