Automatic topography of high-dimensional data sets by non-parametric Density Peak clustering

About

Data analysis in high-dimensional spaces aims at obtaining a synthetic description of a data set, revealing its main structure and its salient features. We here introduce an approach providing this description in the form of a topography of the data, namely a human-readable chart of the probability density from which the data are harvested. The approach is based on an unsupervised extension of Density Peak clustering and a non-parametric density estimator that measures the probability density in the manifold containing the data. This allows finding automatically the number and the height of the peaks of the probability density, and the depth of the "valleys" separating them. Importantly, the density estimator provides a measure of the error, which allows distinguishing genuine density peaks from density fluctuations due to finite sampling. The approach thus provides robust and visual information about the density peaks' height, their statistical reliability, and their hierarchical organization, offering a conceptually powerful extension of the standard clustering partitions. We show that this framework is particularly useful in the analysis of complex data sets.

Maria d'Errico, Elena Facco, Alessandro Laio, Alex Rodriguez• 2018

Related benchmarks

Task	Dataset	Result
Clustering	Wine	ARI0.05	53
Clustering	pendigits	ARI79	49
Clustering	Dermatology	AMI0.73	26
Clustering	Cancer	ARI0.00e+0	25
Clustering	SEMEION	ARI26	19
Clustering	MULTI-FEAT	AMI83	18
Clustering	Letters	AMI57	16
Clustering	USPS	AMI74	9
Clustering	Soybean	AMI54	9
Clustering	Hepatitis	F1 Score0.42	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord