Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Context Patch Fusion With Class Token Enhancement for Weakly Supervised Semantic Segmentation

About

Weakly Supervised Semantic Segmentation (WSSS), which relies only on image-level labels, has attracted significant attention for its cost-effectiveness and scalability. Existing methods mainly enhance inter-class distinctions and employ data augmentation to mitigate semantic ambiguity and reduce spurious activations. However, they often neglect the complex contextual dependencies among image patches, resulting in incomplete local representations and limited segmentation accuracy. To address these issues, we propose the Context Patch Fusion with Class Token Enhancement (CPF-CTE) framework, which exploits contextual relations among patches to enrich feature representations and improve segmentation. At its core, the Contextual-Fusion Bidirectional Long Short-Term Memory (CF-BiLSTM) module captures spatial dependencies between patches and enables bidirectional information flow, yielding a more comprehensive understanding of spatial correlations. This strengthens feature learning and segmentation robustness. Moreover, we introduce learnable class tokens that dynamically encode and refine class-specific semantics, enhancing discriminative capability. By effectively integrating spatial and semantic cues, CPF-CTE produces richer and more accurate representations of image content. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 validate that CPF-CTE consistently surpasses prior WSSS methods.

Yiyang Fu, Hui Li, Wangyu Wu• 2026

Related benchmarks

TaskDatasetResultRank
Semantic segmentationCOCO 2014 (val)
mIoU45.4
251
Semantic segmentationPASCAL VOC 2012 (val)
mIoU69.5
126
Pseudo Ground-Truth GenerationPASCAL VOC 2012 (train)
mIoU70.8
19
Showing 3 of 3 rows

Other info

Follow for update