TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

About

Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it. In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at https://github.com/linyq2117/TagCLIP.

Yuqi Lin, Minghao Chen, Kaipeng Zhang, Hengjia Li, Mingming Li, Zheng Yang, Dongqin Lv, Binbin Lin, Haifeng Liu, Deng Cai• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	COCO Stuff	mIoU3.10e+3	421
Semantic segmentation	COCO	mIoU35.3	119
Multi-label recognition	MS-COCO	mAP70	87
Multi-label recognition	NUS-WIDE	mAP0.387	66
Multi-label recognition	PASCAL VOC 2012	mAP90.8	65
Semantic segmentation	VOC	mIoU68.7	64
Multi-Label Classification	COCO 2014	mAP68.8	55
Multi-Label Classification	VOC 2007	mAP (Average)92.8	52
Multi-label image recognition	MS-COCO 2014 (val)	mAP73.61	51
Multi-label Image Classification	PASCAL VOC 2007	mAP92.8	40

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord