Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

About

Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pre-training data.

Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Ling Shao, Shijian Lu• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU12.3	3069
Semantic segmentation	PASCAL VOC 2012 (val)	Mean IoU51.4	2204
Semantic segmentation	ADE20K	mIoU12.3	1028
Semantic segmentation	Cityscapes	mIoU22.1	668
Semantic segmentation	Cityscapes (val)	mIoU15	572
Semantic segmentation	COCO Stuff	mIoU22.1	399
Semantic segmentation	PASCAL Context (val)	mIoU23.6	360
Semantic segmentation	Pascal VOC	mIoU0.514	280
Semantic segmentation	ADE20K A-150	mIoU11.1	224
Semantic segmentation	Pascal Context 59	mIoU23.6	204

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord