T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting
About
Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Counting | FSC-147 (test) | MAE11.76 | 297 | |
| Crowd Counting | ShanghaiTech Part A (test) | MAE222.2 | 227 | |
| Object Counting | FSC-147 (val) | MAE13.78 | 211 | |
| Crowd Counting | ShanghaiTech Part B (test) | MAE48.9 | 191 | |
| Car Object Counting | CARPK (test) | MAE8.61 | 116 | |
| Counting | CARPK | MAE8.61 | 41 | |
| Object Counting | PASCAL VOC Count 2007 (test) | mRMSE23.5 | 40 | |
| Cell Counting | MBM (test) | MAE30.01 | 14 | |
| Cell Counting | VGG (test) | MAE151.3 | 14 | |
| Object Counting | FSC-147-S (test) | MAE4.69 | 6 |