SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation

About

In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.

Xiuli Bi, Die Xiao, Junchao Fan, Bin Xiao• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	PASCAL VOC 2012 (val)	Mean IoU79.5	2210
Semantic segmentation	PASCAL VOC 2012 (test)	mIoU79.6	1485
Semantic segmentation	COCO 2014 (val)	mIoU50.6	304

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord