Data Language Models: A New Foundation Model Class for Tabular Data

About

Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand between raw data and every AI system that consumes it. We present Schema-1, the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. On missing value reconstruction it achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions, establishing that structural understanding of a dataset's own distributional geometry is more useful for imputation than world knowledge encoded in language. It identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain, a task no prior tabular model can perform. It is the native tabular understanding layer that has been missing from the AI stack.

Eda Erol, Giuliano Pezzoli, Ozer Cem Kelahmet• 2026

Related benchmarks

Task	Dataset	Result
Tabular Imputation	20 real-world datasets MNAR 2026 (test)	NRMSE (5% Missing Rate)0.146	12
Tabular Imputation	20 real-world datasets MCAR 2026 (test)	NRMSE (5% Missing Data)0.118	12
Tabular Imputation	MAR 2026 (test)	NRMSE @ 5%0.128	12
Tabular Imputation	20 Real-world Datasets Overall 2026 (test)	Mean NRMSE0.163	12
Classification	OpenML CC18	Mean ROC-AUC98.49	2
Classification with missing data	OpenML-CC18 0–70% missing data	Mean ROC-AUC0.9196	2
Column-agnostic Classification	OpenML-CC18 no column names	ROC-AUC93.18	2
Imputation	OpenML CC18	NRMSE0.163	2
Sector classification	Blind	Top-1 Accuracy91.4	2
Sequential fine-tuning	OpenML-CC18 Sequential	Retention Rate97.8	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord