YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

About

We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the 'cardinality' map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.

Guanning Zeng, Xiang Zhang, Zirui Wang, Haiyang Xu, Zeyuan Chen, Bingnan Li, Zhuowen Tu• 2025

Related benchmarks

Task	Dataset	Result
Object Counting	FSC-147 (test)	MAE14.8	379
Object Counting	FSC-147 (val)	MAE14.8	279
Object Counting	FSC-147 (Average)	MAE15.12	19
Object Counting	CLOC (test)	MAE34.15	17

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord