Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Prototype Guided Post-pretraining for Single-Cell Representation Learning

About

Single-cell representation learning (SCRL) from gene expression data offers a way to uncover the complex regulatory logic underlying cellular function. Inspired by large language models in natural language modeling, several single-cell pretrained models have recently been proposed that treat genes as tokens and cells as sentences. However, these models are fundamentally limited by the long-tailed nature of cell-type distributions and struggle to generalize under covariate shifts in gene expression data. While fine-tuning is often used to mitigate these issues, we observe that performance remains bounded. To address this challenge, we introduce CellRefine, a post-pretraining method that operates between the pretraining and fine-tuning stages of a single-cell foundation model. CellRefine uses a multi-faceted objective that incorporates marker-gene sets as structural priors to guide post-pretraining and refine the latent embedding manifold of cells. Across multiple computational biology tasks, empirical results show that CellRefine consistently improves downstream performance, yielding gains up to 15%.

Sachini Weerasekara, Natasha Darras, Sagar Kamarthi, Colles Price, Jacqueline Isaacs• 2026

Related benchmarks

TaskDatasetResultRank
On-domain cell identity predictionblood
Macro F1 Score84
10
On-domain cell identity predictionPancreas
Macro F183
10
On-domain cell identity predictionMyeloid
Macro F139
10
On-domain cell identity predictionLiver
Macro F138
10
On-domain cell identity predictionLivST1
Macro F177
10
On-domain cell identity predictionLivST2
Macro F173
10
On-domain cell identity predictionMS
Macro F1 Score75
10
On-domain cell identity predictionLung
Macro F1 Score97
10
On-domain cell identity predictionPBMC10k
Macro F1 Score96
10
On-domain cell identity predictionHeart
Macro F191
10
Showing 10 of 13 rows

Other info

Follow for update