Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data

About

This paper studies the problem of novel category discovery on single- and multi-modal data with labels from different but relevant categories. We present a generic, end-to-end framework to jointly learn a reliable representation and assign clusters to unlabelled data. To avoid over-fitting the learnt embedding to labelled data, we take inspiration from self-supervised representation learning by noise-contrastive estimation and extend it to jointly handle labelled and unlabelled data. In particular, we propose using category discrimination on labelled data and cross-modal discrimination on multi-modal data to augment instance discrimination used in conventional contrastive learning approaches. We further employ Winner-Take-All (WTA) hashing algorithm on the shared representation space to generate pairwise pseudo labels for unlabelled data to better predict cluster assignments. We thoroughly evaluate our framework on large-scale multi-modal video benchmarks Kinetics-400 and VGG-Sound, and image benchmarks CIFAR10, CIFAR100 and ImageNet, obtaining state-of-the-art results.

Xuhui Jia, Kai Han, Yukun Zhu, Bradley Green• 2021

Related benchmarks

Task	Dataset	Result
Generalized Category Discovery	CIFAR-100	Accuracy (All)44.1	268
Generalized Category Discovery	ImageNet-100	All Accuracy33.1	252
Category Discovery	CUB-200 2011	Overall Score26.5	87
Generalized Category Discovery	CUB-200 (test)	Overall Accuracy26.5	81
Category Discovery	Stanford Cars	Accuracy (All)20	71
Category Discovery	CIFAR10	Accuracy (All)65.4	60
Generalized Category Discovery	Oxford Pets	Accuracy (All)35.2	50
Category Discovery	Food101	Accuracy (All)18.2	45
Category Discovery	CIFAR-100	Accuracy (All Categories)44.1	39
Fine-grained object category discovery	Stanford Cars (test)	--	38

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord