Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees
About
We introduce Lumbermark, a robust divisive clustering algorithm capable of detecting clusters of varying sizes, densities, and shapes. Lumbermark iteratively chops off large limbs connected by protruding segments of a dataset's mutual reachability minimum spanning tree. The use of mutual reachability distances smoothens the data distribution and decreases the influence of low-density objects, such as noise points between clusters or outliers at their peripheries. The algorithm can be viewed as an alternative to HDBSCAN that produces partitions with user-specified sizes. A fast, easy-to-use implementation of the new method is available in the open-source 'lumbermark' package for Python and R. We show that Lumbermark performs well on benchmark data and hope it will prove useful to data scientists and practitioners across different fields.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Clustering | Random datasets Uniform distribution on the unit interval in R^d (test) | Runtime (s)0.01 | 117 | |
| Clustering | Benchmark dataset repository All labels v1.1.0 | Cases with AR < 0.8 Count11 | 14 | |
| Clustering | Benchmark dataset repository Third-party labels only v1.1.0 | Cases with AR < 0.86 | 14 |