Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

About

Similarity-based Hierarchical Clustering (HC) is a classical unsupervised machine learning algorithm that has traditionally been solved with heuristic algorithms like Average-Linkage. Recently, Dasgupta reframed HC as a discrete optimization problem by introducing a global cost function measuring the quality of a given tree. In this work, we provide the first continuous relaxation of Dasgupta's discrete optimization problem with provable quality guarantees. The key idea of our method, HypHC, is showing a direct correspondence from discrete trees to continuous representations (via the hyperbolic embeddings of their leaf nodes) and back (via a decoding algorithm that maps leaf embeddings to a dendrogram), allowing us to search the space of discrete binary trees with continuous optimization. Building on analogies between trees and hyperbolic space, we derive a continuous analogue for the notion of lowest common ancestor, which leads to a continuous relaxation of Dasgupta's discrete objective. We can show that after decoding, the global minimizer of our continuous relaxation yields a discrete tree with a (1 + epsilon)-factor approximation for Dasgupta's optimal tree, where epsilon can be made arbitrarily small and controls optimization challenges. We experimentally evaluate HypHC on a variety of HC benchmarks and find that even approximate solutions found with gradient descent have superior clustering quality than agglomerative heuristics or other gradient based algorithms. Finally, we highlight the flexibility of HypHC using end-to-end training in a downstream classification task.

Ines Chami, Albert Gu, Vaggos Chatziafratis, Christopher R\'e• 2020

Related benchmarks

TaskDatasetResultRank
Hierarchical Agglomerative ClusteringWine
Dendrogram Purity0.887
26
Hierarchical Agglomerative ClusteringIris
Dendrogram Purity0.76
20
Hierarchical Agglomerative ClusteringDigits
Dendrogram Purity0.335
20
Hierarchical ClusteringBr. Cancer
DP Score96.5
10
Hierarchical ClusteringSpambase
DP75.4
10
Hierarchical Clusteringzoo
DP0.968
10
ClassificationIris (test)
Accuracy85.6
10
Hierarchical ClusteringSpambase
Dasgupta's Cost7.92e+7
10
Hierarchical ClusteringOpticalDigits
Dasgupta Cost1.27e+6
10
Hierarchical Clusteringpendigits
DP11.7
9
Showing 10 of 13 rows

Other info

Follow for update