Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

About

Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount.

Yifei Qian, Zhongliang Guo, Bowen Deng, Chun Tong Lei, Shuai Zhao, Chun Pong Lau, Xiaopeng Hong, Michael P. Pound• 2025

Related benchmarks

TaskDatasetResultRank
Object CountingFSC-147 (test)
MAE11.76
297
Crowd CountingShanghaiTech Part A (test)
MAE222.2
227
Object CountingFSC-147 (val)
MAE13.78
211
Crowd CountingShanghaiTech Part B (test)
MAE48.9
191
Car Object CountingCARPK (test)
MAE8.61
116
CountingCARPK
MAE8.61
41
Object CountingPASCAL VOC Count 2007 (test)
mRMSE23.5
40
Cell CountingMBM (test)
MAE30.01
14
Cell CountingVGG (test)
MAE151.3
14
Object CountingFSC-147-S (test)
MAE4.69
6
Showing 10 of 16 rows

Other info

Code

Follow for update