LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation
About
We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open Vocabulary Semantic Segmentation | COCOStuff (val) | mIoU25.9 | 60 | |
| Open Vocabulary Semantic Segmentation | Cityscapes (val) | mIoU37.3 | 37 | |
| Open Vocabulary Semantic Segmentation | PASCAL Context 59 (val) | mIoU37.8 | 32 | |
| Open-Vocabulary Segmentation | Pascal VOC 21 2012 (val) | mIoU61.1 | 27 | |
| Open-Vocabulary Segmentation | Pascal Context 60 (val) | mIoU34.6 | 26 | |
| Open-Vocabulary Segmentation | ADE20K (ADE) (val) | mIoU21.8 | 25 | |
| Open-Vocabulary Segmentation | COCO-Object (COCO-O) (val) | mIoU33.4 | 25 | |
| Open-Vocabulary Segmentation | Pascal VOC 20 2012 (val) | mIoU78.8 | 23 | |
| Open-Vocabulary Segmentation | Natural-scene (NS) benchmark suite V21, PC60, COCO-O, V20, PC59, COCO-S, City, ADE | V21 mIoU (with background)61.1 | 18 |