Persistent Multiscale Density-based Clustering
About
Clustering is a cornerstone of modern data analysis. Detecting clusters in exploratory data analyses (EDA) requires algorithms that make few assumptions about the data. Density-based clustering algorithms are particularly well-suited for EDA because they describe high-density regions, assuming only that a density exists. Applying density-based clustering algorithms in practice, however, requires selecting appropriate hyperparameters, which is difficult without prior knowledge of the data distribution. For example, DBSCAN requires selecting a density threshold, and HDBSCAN* relies on a minimum cluster size parameter. In this work, we propose Persistent Leaves Spatial Clustering for Applications with Noise (PLSCAN). This novel density-based clustering algorithm efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters. PLSCAN applies scale-space clustering principles and is equivalent to persistent homology on a novel metric space. We compare its performance to HDBSCAN* on several real-world datasets, demonstrating that it achieves a higher average ARI and is less sensitive to changes in the number of mutual reachability neighbours. Additionally, we compare PLSCAN's computational costs to k-Means, demonstrating competitive run-times on low-dimensional datasets. At higher dimensions, run times scale more similarly to HDBSCAN*.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Clustering | CIFAR-10 | -- | 243 | |
| Clustering | Fashion MNIST | -- | 95 | |
| Clustering | Iris | ARI0.9 | 29 | |
| Clustering | MNIST | ARI0.96 | 19 | |
| Clustering | E.coli | ARI0.63 | 18 | |
| Clustering | YE | ARI0.75 | 16 | |
| Clustering | Articles-1442-5 | ARI99 | 7 | |
| Clustering | Articles-1442-80 | ARI0.99 | 7 | |
| Clustering | Audioset music | ARI0.27 | 7 | |
| Clustering | CTG | ARI1 | 7 |