Prototype Guided Post-pretraining for Single-Cell Representation Learning
About
Single-cell representation learning (SCRL) from gene expression data offers a way to uncover the complex regulatory logic underlying cellular function. Inspired by large language models in natural language modeling, several single-cell pretrained models have recently been proposed that treat genes as tokens and cells as sentences. However, these models are fundamentally limited by the long-tailed nature of cell-type distributions and struggle to generalize under covariate shifts in gene expression data. While fine-tuning is often used to mitigate these issues, we observe that performance remains bounded. To address this challenge, we introduce CellRefine, a post-pretraining method that operates between the pretraining and fine-tuning stages of a single-cell foundation model. CellRefine uses a multi-faceted objective that incorporates marker-gene sets as structural priors to guide post-pretraining and refine the latent embedding manifold of cells. Across multiple computational biology tasks, empirical results show that CellRefine consistently improves downstream performance, yielding gains up to 15%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| On-domain cell identity prediction | blood | Macro F1 Score84 | 10 | |
| On-domain cell identity prediction | Pancreas | Macro F183 | 10 | |
| On-domain cell identity prediction | Myeloid | Macro F139 | 10 | |
| On-domain cell identity prediction | Liver | Macro F138 | 10 | |
| On-domain cell identity prediction | LivST1 | Macro F177 | 10 | |
| On-domain cell identity prediction | LivST2 | Macro F173 | 10 | |
| On-domain cell identity prediction | MS | Macro F1 Score75 | 10 | |
| On-domain cell identity prediction | Lung | Macro F1 Score97 | 10 | |
| On-domain cell identity prediction | PBMC10k | Macro F1 Score96 | 10 | |
| On-domain cell identity prediction | Heart | Macro F191 | 10 |