Multi-level Cross-modal Alignment for Image Clustering

About

Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could produce poor-quality pseudo-labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel \textbf{Multi-level Cross-modal Alignment} method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method.

Liping Qiu, Qin Zhang, Xiaojun Chen, Shaotian Cai• 2024

Related benchmarks

Task	Dataset	Result
Image Clustering	CIFAR-10	NMI0.849	318
Image Clustering	STL-10	ACC98.1	282
Clustering	CIFAR-10 (test)	Accuracy92.7	190
Clustering	STL-10 (test)	Accuracy98.1	152
Clustering	Imagenet Dogs	NMI73.3	105
Clustering	ImageNet-Dogs (test)	NMI0.733	40

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord