Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers
About
Language-Assisted Image Clustering (LAIC) augments the input images with additional texts with the help of vision-language models (VLMs) to promote clustering performance. Despite recent progress, existing LAIC methods often overlook two issues: (i) textual features constructed for each image are highly similar, leading to weak inter-class discriminability; (ii) the clustering step is restricted to pre-built image-text alignments, limiting the potential for better utilization of the text modality. To address these issues, we propose a new LAIC framework with two complementary components. First, we exploit cross-modal relations to produce more discriminative self-supervision signals for clustering, as it compatible with most VLMs training mechanisms. Second, we learn category-wise continuous semantic centers via prompt learning to produce the final clustering assignments. Extensive experiments on eight benchmark datasets demonstrate that our method achieves an average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability. Code is available in the supplementary material.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Clustering | CIFAR-10 | NMI0.852 | 318 | |
| Image Clustering | STL-10 | ACC98.5 | 282 | |
| Image Clustering | ImageNet-10 | NMI0.996 | 201 | |
| Clustering | CIFAR-10 (test) | Accuracy92.9 | 190 | |
| Clustering | STL-10 (test) | Accuracy98.5 | 152 | |
| Clustering | CIFAR-100 (test) | ACC58.1 | 123 | |
| Clustering | Imagenet Dogs | NMI86.2 | 85 | |
| Clustering | ImageNet-10 (test) | ACC99.8 | 74 | |
| Clustering | ImageNet-Dogs (test) | NMI0.862 | 40 | |
| Image Clustering | DTD (test) | NMI63.8 | 13 |