High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation
About
Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and language in regional representations. This motivates us to present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. Due to the limited diversity of image segmentation datasets with mask annotations, we propose incorporating a consistency alignment principle during fine-tuning, which alleviates categorical bias toward the fine-tuning dataset. After low-cost fine-tuning, MaskCLIP++ significantly improves the mask classification performance on multi-domain datasets. Combining with the mask generator in previous state-of-the-art mask-based open vocabulary segmentation methods, we achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets, respectively. Code is avaliable at https://github.com/HVision-NKU/MaskCLIPpp .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | PC-459 | mIoU23.9 | 43 | |
| Semantic segmentation | PC-59 | mIoU62.6 | 38 | |
| Panoptic Segmentation | ADE20K 150 59 (val) | Panoptic Quality (PQ)28.1 | 35 | |
| Instance Segmentation | ADE20K 150 59 (val) | AP17.3 | 30 | |
| Semantic segmentation | A-847 | mIoU16.8 | 14 | |
| Semantic segmentation | A-150 | mIoU38.2 | 13 | |
| Event Instance Segmentation | DSEC Detection | AP16.3 | 12 | |
| Semantic segmentation | PAS-20 | mIoU96.8 | 9 | |
| Mask Classification | ADE20K 847 | Mask Accuracy0.384 | 5 | |
| Mask Classification | Pascal Context 459 | Mask Accuracy56.4 | 5 |