MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching
About
Entity Matching (EM), which aims to identify all entity pairs referring to the same real-world entity from relational tables, is one of the most important tasks in real-world data management systems. Due to the labeling process of EM being extremely labor-intensive, unsupervised EM is more applicable than supervised EM in practical scenarios. Traditional unsupervised EM assumes that all entities come from two tables; however, it is more common to match entities from multiple tables in practical applications, that is, multi-table entity matching (multi-table EM). Unfortunately, effective and efficient unsupervised multi-table EM remains under-explored. To fill this gap, this paper formally studies the problem of unsupervised multi-table entity matching and proposes an effective and efficient solution, termed as MultiEM. MultiEM is a parallelable pipeline of enhanced entity representation, table-wise hierarchical merging, and density-based pruning. Extensive experimental results on six real-world benchmark datasets demonstrate the superiority of MultiEM in terms of effectiveness and efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Entity Matching | GEO | Precision90.5 | 9 | |
| Entity Matching | Music 20K | Precision91.1 | 8 | |
| Multi-table Entity Matching | Shopee | Precision34.5 | 7 | |
| Entity Matching | Music-200K | Precision83.7 | 6 | |
| Multi-table Entity Matching | Music-2M | Precision69.4 | 4 | |
| Multi-table Entity Matching | Person | Precision33.6 | 4 |