DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

About

Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision. Benefiting from a broader source of supervision, this new paradigm exhibits impressive transferability to downstream classification tasks and datasets. However, the problem of transferring the knowledge learned from image-text pairs to more complex dense prediction tasks has barely been visited. In this work, we present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pre-trained knowledge. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones including both CLIP models and ImageNet pre-trained models. Extensive experiments demonstrate the superior performance of our methods on semantic segmentation, object detection, and instance segmentation tasks. Code is available at https://github.com/raoyongming/DenseCLIP

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu• 2021

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU49.8	3069
Object Detection	COCO 2017 (val)	AP40.2	2843
Semantic segmentation	ADE20K	mIoU50.6	559
Semantic segmentation	Cityscapes	mIoU81	494
Medical Image Segmentation	BUSI	Dice Score71.85	134
Medical Image Segmentation	CVC-ClinicDB	Dice Score68.08	118
Medical Image Segmentation	Medical Image Segmentation Aggregate (Average of BUSI, BTMRI, ISIC, Kvasir-SEG, QaTa-COV19, and EUS) (test)	DSC74.19	80
Medical Image Segmentation	ISIC	DICE89.29	79
Medical Image Segmentation	BTMRI (Source)	DSC70.3	24
Semantic segmentation	GTA to UAVID	Road IoU25.5	15

Showing 10 of 23 rows

Other info

Code

Follow for update

@wizwand_team Discord