Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation

About

Column type annotation is vital for tasks like data cleaning, integration, and visualization. Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables, i.e., a source data lake. In this paper, we study whether we can adapt an existing pre-trained LM-based model to a new (i.e., target) data lake to minimize the annotations required on the new data lake. However, challenges include the source-target knowledge gap, selecting informative target data, and fine-tuning without losing shared knowledge exist. We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions, employs a cluster-based data selection scheme for unannotated columns, and uses an incremental fine-tuning mechanism that gradually adapts the source model to the target data lake. Our experimental results validate the effectiveness of LakeHopper on two different data lake transfers under both low-resource and high-resource settings.

Yushi Sun, Xujia Li, Nan Tang, Quanqing Xu, Chuanhui Yang, Lei Chen• 2026

Related benchmarks

TaskDatasetResultRank
Column Type AnnotationPublicBI to GitTables
SW F170.3
32
Column Type AnnotationSemtab low-resource 2019
SW F162.4
26
Column Type AnnotationVizNet--
11
Column Type AnnotationSemtab 2019 (test)--
11
Column Type AnnotationPublicBI to VizNet 25% (3745 col)
SW F1 Score87.6
10
Column Type AnnotationPublicBI to VizNet (50% (7490 col))
SW F190.4
10
Column Type AnnotationPublicBI to VizNet 100% (14980 col)
SW F192.9
10
Column Type AnnotationVizNet to Semtab 25% (1363 col) 2019
SW F175.7
10
Column Type AnnotationVizNet to Semtab2019 50% (2725 col)
SW F179.9
10
Column Type AnnotationVizNet to Semtab 2019 (100% (5450 col))
SW F10.82
10
Showing 10 of 14 rows

Other info

Follow for update