Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TabPFN-Wide: Continued Pre-Training for Extreme Feature Counts

About

Revealing novel insights from the relationship between molecular measurements and pathology remains a very impactful application of machine learning in biomedicine. Data in this domain typically contain only a few observations but thousands of potentially noisy features, posing challenges for conventional tabular machine learning approaches. While prior-data fitted networks emerge as foundation models for predictive tabular data tasks, they are currently not suited to handle large feature counts (>500). Although feature reduction enables their application, it hinders feature importance analysis. We propose a strategy that extends existing models through continued pre-training on synthetic data sampled from a customized prior. The resulting model, TabPFN-Wide, matches or exceeds its base model's performance, while exhibiting improved robustness to noise. It seamlessly scales beyond 30,000 categorical and continuous features, regardless of noise levels, while maintaining inherent interpretability, which is critical for biomedical applications. Our results demonstrate that prior-informed adaptation is suitable to enhance the capability of foundation models for high-dimensional data. On real-world omics datasets, we show that many of the most relevant features identified by the model overlap with previous biological findings, while others propose potential starting points for future studies.

Christopher Kolberg, Jules Kreuer, Jonas Huurdeman, Sofiane Ouaari, Katharina Eggensperger, Nico Pfeifer• 2025

Related benchmarks

TaskDatasetResultRank
Molecular and Clinical ClassificationCPTAC COAD
AUROC95.9
11
Multiomics ClassificationBRCA Multiomics
AUROC0.978
11
Multiomics ClassificationGBM Multiomics
AUROC0.965
11
Multiomics ClassificationOV Multiomics
AUROC98.7
11
Multiomics ClassificationGBM
Average Accuracy81.6
11
Multiomics ClassificationLGG
Average Accuracy97.6
11
Multiomics ClassificationOV
Average Accuracy88.7
11
Multiomics ClassificationLGG Multiomics
AUROC98.8
11
Multiomics ClassificationBRCA
Average Accuracy86.4
11
Multiomics ClassificationCOAD
Average Accuracy87.3
11
Showing 10 of 10 rows

Other info

Follow for update