Segment-driven Structural Induction and Semantic Alignment for Heterogeneous Tabular Representation
About
Real-world domains often contain heterogeneous tables whose headers vary while their underlying attribute semantics are shared, making it difficult to induce domain-specialized semantics from table-local evidence alone. Existing encoders model parts of this problem, but often underuse column-level value distributions and apply uniform objectives across attributes with different semantic roles. We propose NAVI, a segment-centric pretraining framework that treats each header-value pair as the unit for aggregating schema-level structural evidence and column-level distributional evidence. We realize this design through Masked Segment Modeling and Entropy-driven Segment Alignment, which jointly enforce structured header-value coupling and semantic alignment across stable and instance-specific attributes. Experiments on heterogeneous in-domain tables show improved reconstruction, semantic consistency, and downstream utility across evaluation settings overall.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Header Prediction | Product (test) | Accuracy99.97 | 16 | |
| Header Prediction | Movie (test) | Accuracy99.98 | 16 | |
| Row Classification | Product (test) | Macro-F1 (XGBoost)94.4 | 11 | |
| Row Classification | Movie (test) | Macro F1 (XGBoost)62.9 | 11 | |
| Header Prediction | Product | Accuracy99.95 | 7 | |
| Header Prediction | Movie | Accuracy99.98 | 7 | |
| Value Imputation | Product | Accuracy79.77 | 7 | |
| Value Imputation | Movie | Accuracy70.77 | 7 | |
| Header Clustering | Product domain | NMI90.05 | 4 | |
| Header Clustering | Movie domain | NMI91.44 | 4 |