Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting

About

Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP is proposed. Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is achieved by incorporating three modules devised to tailor VLBase for object counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, the layer-wisely encoded features are transferred to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, the benefits of the end-to-end framework, VLCounter, are demonstrated.

Seunggu Kang, WonJun Moon, Euiyeon Kim, Jae-Pil Heo• 2023

Related benchmarks

TaskDatasetResultRank
Object CountingFSC-147 (test)
MAE17.05
297
Object CountingFSC-147 (val)
MAE18.06
211
Car Object CountingCARPK (test)
MAE6.46
116
CountingCARPK
MAE6.46
41
Object CountingPASCAL VOC Count 2007 (test)
mRMSE28.9
40
Car CountingPUCPR+ (test)
MAE48.94
31
Object CountingPUCPR+
MAE48.94
6
Object CountingFSC-147-S (test)
MAE35.24
6
Object CountingIOCfish5K (test)
MAE78
5
Object CountingIOCfish5k
MAE78
2
Showing 10 of 10 rows

Other info

Code

Follow for update