Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology

About

Understanding the biological mechanisms of disease is crucial for medicine, and in particular, for drug discovery. AI-powered analysis of genome-scale biological data holds great potential in this regard. The increasing availability of single-cell RNA sequencing data has enabled the development of large foundation models for disease biology. However, existing foundation models only modestly improve over task-specific models in downstream applications. Here, we explored two avenues for improving single-cell foundation models. First, we scaled the pre-training data to a diverse collection of 116 million cells, which is larger than those used by previous models. Second, we leveraged the availability of large-scale biological annotations as a form of supervision during pre-training. We trained the \model family of models comprising six transformer-based state-of-the-art single-cell foundation models with 70 million, 160 million, and 400 million parameters. We vetted our models on several downstream evaluation tasks, including identifying the underlying disease state of held-out donors not seen during training, distinguishing between diseased and healthy cells for disease conditions and donors not seen during training, and probing the learned representations for known biology. Our models showed substantial improvement over existing works, and scaling experiments showed that performance improved predictably with both data volume and parameter count.

Alexis Chevalier, Soumya Ghosh, Urvi Awasthi, James Watkins, Julia Bieniewska, Nichita Mitrea, Olga Kotova, Kirill Shkura, Andrew Noble, Michael Steinbaugh, Vijay Sadashivaiah, George Dasoulas, Julien Delile, Christoph Meier, Leonid Zhukov, Iya Khalil, Srayanta Mukherjee, Judith Mueller• 2025

Related benchmarks

TaskDatasetResultRank
Disease ClassificationPediatric Crohn’s Disease
Accuracy74
8
Classificationdonors (held-out)
Accuracy72
5
Disease ClassificationCKD (held-out)
Accuracy94
5
Disease ClassificationRheumatoid Arthritis (held-out)
Accuracy56
5
In-silico single-gene perturbationHeart cell atlas (236 fetal cardiomyocyte transcriptomes)
pHK (Direct)9.45
5
Disease ClassificationAlzheimers (held-out)
Accuracy75
5
Disease ClassificationGastric Cancer (held-out)
Accuracy56
5
Disease ClassificationChronic Kidney Disease
Accuracy95
3
Disease ClassificationAlzheimer’s Disease
Accuracy86
3
Disease ClassificationGastric cancer
Accuracy70
3
Showing 10 of 13 rows

Other info

Follow for update