Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering

About

Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.

Mingjie Zhao, Yunfan Zhang, Yiqun Zhang, Yiu-ming Cheung• 2026

Related benchmarks

TaskDatasetResultRank
Tabular Data ClusteringSO
ARI0.4773
8
Tabular Data ClusteringCA
ARI0.1862
8
Tabular Data ClusteringLE
ARI0.4109
8
Tabular Data ClusteringMA
ARI41.34
8
Tabular Data ClusteringZO
ARI0.8548
8
Tabular Data ClusteringHA
ARI0.0863
8
Tabular Data ClusteringFE
ARI0.0637
8
Tabular Data ClusteringHE
ARI0.1563
8
Tabular Data ClusteringNU
ARI28.22
8
Tabular Data ClusteringMU
ARI0.5266
8
Showing 10 of 12 rows

Other info

Follow for update