Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals

About

We report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT, to our knowledge the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel. To isolate this geometry, we introduce a general three-stage extraction method consisting of direct operator export from frozen attention weights, a lightweight learned adaptor, and a task-specific readout, producing a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering and leads on key subtype endpoints (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). Compared to standard probing of frozen scGPT embeddings with a 3-layer MLP, the extracted head is BH-significantly better on 6/8 classification endpoints while completing a full 12-split evaluation campaign 34.5x faster with approximately 1000x fewer trainable parameters. The exported operator compresses from three pooled attention heads to a single head without statistically significant loss, and further to a rank-64 surrogate. Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. A supplementary second-manifold validation (intercellular communication geometry) confirms that the extraction method generalizes beyond hematopoiesis.

Ihor Kendiukhov• 2026

Related benchmarks

TaskDatasetResultRank
Branch classificationTabula Sapiens Large-cell Robust-V2 (pooled donor-holdout)
Branch Balanced Accuracy82.8
10
Cell Type DiscriminationTabula Sapiens Large-cell Robust-V2 (pooled donor-holdout)
CD4/CD8 AUROC0.867
10
Pseudotime EstimationTabula Sapiens Large-cell Robust-V2 (pooled donor-holdout)
Pseudotime Spearman Correlation0.249
10
Stage classificationTabula Sapiens Large-cell Robust-V2 (pooled donor-holdout)
Stage Balanced Accuracy55.2
10
Pseudotime EstimationRobust V2 (pooled donor-holdout)
Mean Difference-0.092
9
Unsupervised branch-inferenceCanonical (100k)
ARI61.3
8
Unsupervised branch-inferenceCanonical 4 held-out donors (100k split)
ARI0.387
8
Stage classificationRobust cells V2 (test)
Balanced Accuracy55.2
4
Branch classificationRobust V2 (test)
Balanced Acc82.8
2
CD4/CD8 identificationRobust cells V2 (test)
AUROC0.867
2
Showing 10 of 11 rows

Other info

Follow for update