Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering

About

Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.

Mingjie Zhao, Yunfan Zhang, Yiqun Zhang, Yiu-ming Cheung• 2026

Related benchmarks

Task	Dataset	Result
Tabular Data Clustering	SO	ARI0.4773	8
Tabular Data Clustering	CA	ARI0.1862	8
Tabular Data Clustering	LE	ARI0.4109	8
Tabular Data Clustering	MA	ARI41.34	8
Tabular Data Clustering	ZO	ARI0.8548	8
Tabular Data Clustering	HA	ARI0.0863	8
Tabular Data Clustering	FE	ARI0.0637	8
Tabular Data Clustering	HE	ARI0.1563	8
Tabular Data Clustering	NU	ARI28.22	8
Tabular Data Clustering	MU	ARI0.5266	8

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord