Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

About

Structured data is widely used in domains such as healthcare, finance, and scientific data management. Recent studies on structured data foundation models (SFMs) aim to support data analysis and mining tasks over such data, but still face scalability and generalization challenges when applied to real-world enterprise databases. First, many SFMs rely on full self-attention, which introduces an O(N^2) computational bottleneck and limits the number of tuples that can be processed jointly. Second, directly replacing attention with linear-complexity sequence models may conflict with the permutation-invariant nature of structured data, introducing artificial order bias and degrading representation quality. Moreover, models trained only on synthetic data may struggle to generalize to the heavy-tailed and heterogeneous distributions commonly found in real-world databases. To address these challenges, we propose FEAT, a linear-complexity foundation model for extremely large structured data. FEAT replaces quadratic attention with a multi-layer dual-axis encoding architecture. It integrates an adaptive-fusion bidirectional state-space model (AFBM) with convolutional gated linear attention (Conv-GLA), enabling cross-tuple contextualization in O(N) time while supporting permutation-invariant representation learning. To improve robustness under real-world data skewness, FEAT further adopts a hybrid structural causal pre-training pipeline with a robust reconstruction objective. Experiments on 12 real-world database benchmarks show that FEAT consistently outperforms representative SFMs on zero-shot tasks and scales linearly with structured-data sample length, achieving up to 50x faster inference latency.

Zhenghang Song, Tang Qian, Lu Chen, Yushuai Li, Zhengke Hu, Bingbing Fang, Yumeng Song, Junbo Zhao, Sheng Zhang, Tianyi Li• 2026

Related benchmarks

TaskDatasetResultRank
InferenceScalability and Efficiency Evaluation D=20 (test)
Inference Latency (ms)149.2
26
RegressionGI-REG
RMSE0.4703
10
RegressionBCCO-REG
RMSE0.406
10
ClassificationGI-CLS
AUC0.8991
9
ClassificationTabarena CLS
AUC0.8638
9
ClassificationTabzilla CLS
AUC92.51
9
RegressionCTR23-REG
RMSE0.4053
9
RegressionPFN REG
RMSE0.5257
9
ClassificationBCCO-CLS
AUC85.79
9
RegressionTalent-REG
RMSE0.4708
9
Showing 10 of 12 rows

Other info

Follow for update