Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers

About

Language-Assisted Image Clustering (LAIC) augments the input images with additional texts with the help of vision-language models (VLMs) to promote clustering performance. Despite recent progress, existing LAIC methods often overlook two issues: (i) textual features constructed for each image are highly similar, leading to weak inter-class discriminability; (ii) the clustering step is restricted to pre-built image-text alignments, limiting the potential for better utilization of the text modality. To address these issues, we propose a new LAIC framework with two complementary components. First, we exploit cross-modal relations to produce more discriminative self-supervision signals for clustering, as it compatible with most VLMs training mechanisms. Second, we learn category-wise continuous semantic centers via prompt learning to produce the final clustering assignments. Extensive experiments on eight benchmark datasets demonstrate that our method achieves an average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability. Code is available in the supplementary material.

Jun Ma, Xu Zhang, Zhengxing Jiao, Yaxin Hou, Hui Liu, Junhui Hou, Yuheng Jia• 2026

Related benchmarks

Task	Dataset	Result
Image Clustering	CIFAR-10	NMI0.852	318
Image Clustering	STL-10	ACC98.5	282
Image Clustering	ImageNet-10	NMI0.996	220
Clustering	CIFAR-10 (test)	Accuracy92.9	190
Clustering	STL-10 (test)	Accuracy98.5	152
Clustering	CIFAR-100 (test)	ACC58.1	123
Clustering	Imagenet Dogs	NMI86.2	105
Clustering	ImageNet-10 (test)	ACC99.8	74
Clustering	ImageNet-Dogs (test)	NMI0.862	40
Image Clustering	DTD (test)	NMI63.8	13

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord