Rethinking Divisive Hierarchical Clustering from a Distributional Perspective
About
We uncover that current objective-based Divisive Hierarchical Clustering (DHC) methods produce a dendrogram that does not have three desired properties i.e., no unwarranted splitting, group similar clusters into a same subset, ground-truth correspondence. This shortcoming has their root cause in using a set-oriented bisecting assessment criterion. We show that this shortcoming can be addressed by using a distributional kernel, instead of the set-oriented criterion; and the resultant clusters achieve a new distribution-oriented objective to maximize the total similarity of all clusters (TSC). Our theoretical analysis shows that the resultant dendrogram guarantees a lower bound of TSC. The empirical evaluation shows the effectiveness of our proposed method on artificial and Spatial Transcriptomics (bioinformatics) datasets. Our proposed method successfully creates a dendrogram that is consistent with the biological regions in a Spatial Transcriptomics dataset, whereas other contenders fail.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Hierarchical Agglomerative Clustering | Wine | Dendrogram Purity0.95 | 26 | |
| Hierarchical Clustering | LSVT | Dendrogram Purity74 | 6 | |
| Hierarchical Clustering | musk | Dendrogram Purity57 | 6 | |
| Hierarchical Clustering | Spam | Dendrogram Purity84 | 6 | |
| Hierarchical Clustering | STL-10 | Dendrogram Purity0.63 | 6 | |
| Hierarchical Clustering | ALLAML | Dendrogram Purity73 | 6 | |
| Hierarchical Clustering | SEEDS | Dendrogram Purity87 | 6 | |
| Hierarchical Clustering | WDBC | Dendrogram Purity90 | 6 | |
| Hierarchical Clustering | LandCover | Dendrogram Purity55 | 6 | |
| Hierarchical Clustering | banknote | Dendrogram Purity97 | 6 |