Ultra-Scalable Spectral Clustering and Ensemble Clustering
About
This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative selection strategy and a fast approximation method for K-nearest representatives are proposed for the construction of a sparse affinity sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the transfer cut is then utilized to efficiently partition the graph and obtain the clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated into an ensemble clustering framework to enhance the robustness of U-SPEC while maintaining high efficiency. Based on the ensemble generation via multiple U-SEPC's, a new bipartite graph is constructed between objects and base clusters and then efficiently partitioned to achieve the consensus clustering result. It is noteworthy that both U-SPEC and U-SENC have nearly linear time and space complexity, and are capable of robustly and efficiently partitioning ten-million-level nonlinearly-separable datasets on a PC with 64GB memory. Experiments on various large-scale datasets have demonstrated the scalability and robustness of our algorithms. The MATLAB code and experimental data are available at https://www.researchgate.net/publication/330760669.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Clustering | MNIST (test) | -- | 122 | |
| Clustering | MNIST | NMI0.7502 | 92 | |
| Clustering | USPS | NMI73.89 | 82 | |
| Clustering | USPS | Accuracy0.7817 | 36 | |
| Spectral Clustering | Letters | Time cost (s)1.44 | 29 | |
| Spectral Clustering | pendigits | Time Cost (s)1.01 | 29 | |
| Spectral Clustering | USPS | Time (s)1.59 | 29 | |
| Spectral Clustering | MNIST | Time Cost (s)7.48 | 28 | |
| Clustering | pendigits | NMI (%)85.34 | 26 | |
| Spectral Clustering | Covertype | Time Cost (s)14.08 | 22 |