Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

An algorithm for clustering with confidence-based must-link and cannot-link constraints

About

We study here the semi-supervised $k$-clustering problem where information is available on whether pairs of objects are in the same or in different clusters. This information is either available with certainty or with a limited level of confidence. We introduce the PCCC (Pairwise-Confidence-Constraints-Clustering) algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects. Our algorithm uses integer programming for the assignment of objects which allows to include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated subject to a penalty. This flexibility distinguishes our algorithm from the state-of-the-art in which all pairwise constraints are either considered hard, or all are considered soft. We developed an enhanced multi-start approach and a model-size reduction technique for the integer program that contributes to the effectiveness and the efficiency of the algorithm. Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints (which are the most challenging constraints to incorporate). We compare the PCCC algorithm with state-of-the-art approaches in an extensive computational study. Even though the PCCC algorithm is more general than the state-of-the-art approaches in its applicability, it outperforms the state-of-the-art approaches on instances with all hard or all soft constraints both in terms of runtime and various metrics of solution quality. The code of the PCCC algorithm is publicly available on GitHub.

Philipp Baumann, Dorit S. Hochbaum• 2022

Related benchmarks

TaskDatasetResultRank
Pairwise-constrained clusteringIris UCI (full)
SSE83.72
15
Pairwise-constrained clusteringSeeds UCI (full)
SSE600.4
15
Pairwise-constrained clusteringHemi
SSE1.36e+7
14
Pairwise-constrained clusteringRDS CNT
SSE2.93e+7
13
Pairwise-constrained clusteringPR2392
SSE2.77e+10
13
Pairwise-constrained clusteringHTRU2
SSE1.23e+7
12
Pairwise-constrained clusteringAC
SSE2.02e+3
12
Pairwise-constrained clusteringSkin
SSE9.30e+8
12
Showing 8 of 8 rows

Other info

Follow for update