Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Sherlock: A Deep Learning Approach to Semantic Data Type Detection

About

Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on $686,765$ data columns retrieved from the VizNet corpus by matching $78$ semantic types from DBpedia to column headers. We characterize each matched column with $1,588$ features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F$_1$ score of $0.89$, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.

Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, \c{C}a\u{g}atay Demiralp, C\'esar Hidalgo• 2019

Related benchmarks

TaskDatasetResultRank
Column Type AnnotationPublicBI to GitTables
SW F152.3
32
Column Type AnnotationSemtab low-resource 2019
SW F131.3
26
Semantic Type ClassificationVizNet (test)
Support-weighted F10.89
22
Column Type PredictionVizNet
Support-weighted F186.7
11
Column Type AnnotationPublicBI to VizNet 25% (3745 col)
SW F1 Score71.4
10
Column Type AnnotationPublicBI to VizNet (50% (7490 col))
SW F175.9
10
Column Type AnnotationPublicBI to VizNet 100% (14980 col)
SW F179.1
10
Column Type AnnotationVizNet to Semtab 25% (1363 col) 2019
SW F149.4
10
Column Type AnnotationVizNet to Semtab2019 50% (2725 col)
SW F156.7
10
Column Type AnnotationVizNet to Semtab 2019 (100% (5450 col))
SW F10.637
10
Showing 10 of 21 rows

Other info

Follow for update