Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering
About
Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tabular Data Clustering | SO | ARI0.4773 | 8 | |
| Tabular Data Clustering | CA | ARI0.1862 | 8 | |
| Tabular Data Clustering | LE | ARI0.4109 | 8 | |
| Tabular Data Clustering | MA | ARI41.34 | 8 | |
| Tabular Data Clustering | ZO | ARI0.8548 | 8 | |
| Tabular Data Clustering | HA | ARI0.0863 | 8 | |
| Tabular Data Clustering | FE | ARI0.0637 | 8 | |
| Tabular Data Clustering | HE | ARI0.1563 | 8 | |
| Tabular Data Clustering | NU | ARI28.22 | 8 | |
| Tabular Data Clustering | MU | ARI0.5266 | 8 |