CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

About

Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: https://github.com/songrise/CLIP-Count.

Ruixiang Jiang, Lingbo Liu, Changwen Chen• 2023

Related benchmarks

Task	Dataset	Result
Object Counting	FSC-147 (test)	MAE17.78	322
Crowd Counting	ShanghaiTech Part A (test)	MAE192.6	271
Object Counting	FSC-147 (val)	MAE18.76	246
Crowd Counting	ShanghaiTech Part B (test)	MAE45.7	208
Crowd Counting	ShanghaiTech Part B	MAE45.7	177
Crowd Counting	ShanghaiTech Part A	MAE192.6	155
Car Object Counting	CARPK (test)	MAE11.96	116
Counting	CARPK	MAE11.7	52
Object Counting	PASCAL VOC Count 2007 (test)	mRMSE32.7	40
Object Counting	FSC-147 (Average)	MAE18.29	19

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord